Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6776D93BB for ; Wed, 16 Nov 2011 06:11:20 +0000 (UTC) Received: (qmail 93315 invoked by uid 500); 16 Nov 2011 06:11:20 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 93123 invoked by uid 500); 16 Nov 2011 06:11:19 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 93100 invoked by uid 99); 16 Nov 2011 06:11:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Nov 2011 06:11:19 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Nov 2011 06:11:13 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id EED378647B for ; Wed, 16 Nov 2011 06:10:51 +0000 (UTC) Date: Wed, 16 Nov 2011 06:10:51 +0000 (UTC) From: "Hudson (Commented) (JIRA)" To: issues@hbase.apache.org Message-ID: <626668095.34076.1321423851979.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <2008387463.33102.1321399674298.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HBASE-4792) SplitRegionHandler doesn't care if it deletes the znode or not, leaves the parent region stuck offline MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151047#comment-13151047 ] Hudson commented on HBASE-4792: ------------------------------- Integrated in HBase-TRUNK #2445 (See [https://builds.apache.org/job/HBase-TRUNK/2445/]) Adding the 0.92 entry for HBASE-4792 HBASE-4792 SplitRegionHandler doesn't care if it deletes the znode or not, leaves the parent region stuck offline jdcryans : Files : * /hbase/trunk/CHANGES.txt jdcryans : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/SplitRegionHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java > SplitRegionHandler doesn't care if it deletes the znode or not, leaves the parent region stuck offline > ------------------------------------------------------------------------------------------------------ > > Key: HBASE-4792 > URL: https://issues.apache.org/jira/browse/HBASE-4792 > Project: HBase > Issue Type: Bug > Affects Versions: 0.92.0 > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Priority: Critical > Fix For: 0.92.0, 0.94.0 > > Attachments: HBASE-4792-0.92.patch > > > Saw this on a little test cluster, really easy to trigger. > First the master log: > {quote} > 2011-11-15 22:28:57,900 DEBUG org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handling SPLIT event for e5be6551c8584a6a1065466e520faf4e; deleting node > 2011-11-15 22:28:57,900 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x132f043bbde08c1 Deleting existing unassigned node for e5be6551c8584a6a1065466e520faf4e that is in expected state RS_ZK_REGION_SPLIT > 2011-11-15 22:28:57,975 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x132f043bbde08c1 Attempting to delete unassigned node in RS_ZK_REGION_SPLIT state but after verifying state, we got a version mismatch > 2011-11-15 22:28:57,975 INFO org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT report); parent=TestTable,0001355346,1321396080924.e5be6551c8584a6a1065466e520faf4e. daughter a=TestTable,0001355346,1321396132414.df9b549eb594a1f8728608a2a431224a.daughter b=TestTable,0001368082,1321396132414.de861596db4337dc341138f26b9c8bc2. > ... > 2011-11-15 22:28:58,052 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_SPLIT, server=sv4r28s44,62023,1321395865619, region=e5be6551c8584a6a1065466e520faf4e > 2011-11-15 22:28:58,052 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region e5be6551c8584a6a1065466e520faf4e not found on server sv4r28s44,62023,1321395865619; failed processing > 2011-11-15 22:28:58,052 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region e5be6551c8584a6a1065466e520faf4e from server sv4r28s44,62023,1321395865619 but it doesn't exist anymore, probably already processed its split > (repeated forever) > {quote} > The master processes the split but when it calls ZKAssign.deleteNode it doesn't check the boolean that's returned. In this case it was false. So for the master the split was completed, but for the region server it's another story: > {quote} > 2011-11-15 22:28:57,661 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Attempting to transition node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLIT > 2011-11-15 22:28:57,775 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Successfully transitioned node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLIT > 2011-11-15 22:28:57,775 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the master to process the split for e5be6551c8584a6a1065466e520faf4e > 2011-11-15 22:28:57,876 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Attempting to transition node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT > 2011-11-15 22:28:57,967 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Successfully transitioned node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT > 2011-11-15 22:28:58,067 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Attempting to transition node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT > 2011-11-15 22:28:58,108 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Successfully transitioned node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT > (printed forever) > {quote} > Since the znode isn't really deleted, it thinks the master just haven't got to process its region thus waits which leaves the region *unavailable*. > We need to just retry the delete master-side ASAP since the RS will wait 100ms between retries. > At the same time, it would be nice if ZKAssign.deleteNode always printed out the name of the region in its messages because it took me a while to see that the delete didn't take affect while looking at a grep. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira