Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D87A19327 for ; Tue, 8 Nov 2011 04:43:16 +0000 (UTC) Received: (qmail 56047 invoked by uid 500); 8 Nov 2011 04:43:16 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 55996 invoked by uid 500); 8 Nov 2011 04:43:16 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 55976 invoked by uid 99); 8 Nov 2011 04:43:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Nov 2011 04:43:15 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Nov 2011 04:43:12 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 923C6398E7 for ; Tue, 8 Nov 2011 04:42:51 +0000 (UTC) Date: Tue, 8 Nov 2011 04:42:51 +0000 (UTC) From: "stack (Commented) (JIRA)" To: issues@hbase.apache.org Message-ID: <900186474.9668.1320727371600.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <359965386.38207.1306201913764.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146076#comment-13146076 ] stack commented on HBASE-3914: ------------------------------ @mingjian Do you have a log? Do you want to open new issue? > ROOT region appeared in two regionserver's onlineRegions at the same time > ------------------------------------------------------------------------- > > Key: HBASE-3914 > URL: https://issues.apache.org/jira/browse/HBASE-3914 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.3 > Reporter: Jieshan Bean > Assignee: Jieshan Bean > Fix For: 0.90.4 > > Attachments: HBASE-3914-V2.patch, HBASE-3914.patch > > > This could be happen under the following steps with little probability: > (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster) > 1.Root region was opened in RS1. > 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted. > 3.ServerShutdownHandler process start. > 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2. > 5.Root region was opened successfully in RS2. > 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities: > Case a: > ROOT region was assigned to RS1. > It seemed nothing would be affected. But the root region was still online in RS2. > > Case b: > ROOT region was assigned to RS2. > The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver. > This could be proved from the logs: > 1. ROOT region was opened with two times: > 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031 > 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212 > 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server: > 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server > This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here. > There's 2 references about assignRoot(): > 1. > HMaster# assignRootAndMeta: > if (!catalogTracker.verifyRootRegionLocation(timeout)) { > this.assignmentManager.assignRoot(); > this.catalogTracker.waitForRoot(); > assigned++; > } > 2. > ServerShutdownHandler# process: > > if (isCarryingRoot()) { // -ROOT- > try { > this.services.getAssignmentManager().assignRoot(); > } catch (KeeperException e) { > this.server.abort("In server shutdown processing, assigning root", e); > throw new IOException("Aborting", e); > } > } > I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira