Return-Path: Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: (qmail 28842 invoked from network); 3 Oct 2010 22:43:57 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Oct 2010 22:43:57 -0000 Received: (qmail 52868 invoked by uid 500); 3 Oct 2010 22:43:57 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 52799 invoked by uid 500); 3 Oct 2010 22:43:57 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 52791 invoked by uid 99); 3 Oct 2010 22:43:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Oct 2010 22:43:57 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Oct 2010 22:43:54 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o93MhXp7011793 for ; Sun, 3 Oct 2010 22:43:33 GMT Message-ID: <3131767.522111286145813174.JavaMail.jira@thor> Date: Sun, 3 Oct 2010 18:43:33 -0400 (EDT) From: "Nicolas Spiegelberg (JIRA)" To: issues@hbase.apache.org Subject: [jira] Commented: (HBASE-3043) 'hbase-daemon.sh stop regionserver' should kill compactions that are in progress In-Reply-To: <10849273.434311285635693155.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917423#action_12917423 ] Nicolas Spiegelberg commented on HBASE-3043: -------------------------------------------- Pranav's comments: 1) On WriteState variable privacy: 6 of one, half-dozen of the other. I made sure the WriteState variable was package private. I was looking at possibly some more unit tests dealing with our write state, so I didn't want to write a bunch of accessors just to deal with unit tests. In the unit test case, we don't really need to worry about synchronization either. My thought was to add accessor methods if we're going to use it outside of a unit test. Okay? 2) The lack of unlock() actually could have caused some extremely-rare deadlock conditions but only on exit, so no one's probably run across it. Just mainly wanted to fix poor practice. Stack's comment: Your thought is correct. However, I do need to make a small change that I had done internally, but lost when I refactored. This works because of some subtle interactions between server.stopRequested(), CompactSplitThread.lock, & HRegion.writeState.writesEnabled. States that can happen: 1) We get the lock & interrupt compactionQueue.poll(). It throws an InterruptedException, which calls continue, which fails the next while() check, which finishes the close 2) We get the lock & interrupt, but the thread is somewhere between the poll() and the lock(). [In new patch] CompactSplitThread.run() queries stopRequested() immediately after getting the lock(), which skips the compact/split code to return to the while() check and ... 3) We don't get the lock. HRegionServer.run() calls closeAllRegions(), which calls HRegion.close(), which sets the writeState. The compaction sees this, throws an InterruptedIOE, which is aborts the current compaction, goes to the while() check in CompactSplitThread.run() and ... > 'hbase-daemon.sh stop regionserver' should kill compactions that are in progress > -------------------------------------------------------------------------------- > > Key: HBASE-3043 > URL: https://issues.apache.org/jira/browse/HBASE-3043 > Project: HBase > Issue Type: Improvement > Affects Versions: 0.89.20100621, 0.90.0 > Reporter: Nicolas Spiegelberg > Assignee: Nicolas Spiegelberg > Fix For: 0.89.20100924, 0.90.0 > > Attachments: HBASE-3043_0.89.patch, HBASE-3043_0.90.patch > > > During rolling restarts, we'll occasionally get into a situation with our 100-node cluster where a RS stop takes 5-10 minutes. The problem is that the RS is undergoing a compaction and won't stop until it is complete. In a stop situation, it would be preferable to preempt the compaction, delete the newly-created compaction file, and try again once the cluster is restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.