hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time
Date Wed, 25 May 2011 04:51:47 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038956#comment-13038956
] 

stack commented on HBASE-3914:
------------------------------

bq. In future, I will notice what you remind me, and I also hope I can contribute more patches.

No problem and keep the patches coming!

> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914-V2.patch, HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions
in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was
unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then
it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online in RS2.
 
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other regionserver,
because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already
online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request
to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Processing open of -ROOT-,,0.70236052 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Attempted open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened under a special
scenario. And I have checked the code, it seems a tiny bug here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root Region's Location
first. Because before the assigning, the ROOT region could have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message