hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey Zhong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9387) Region could get lost during assignment
Date Fri, 30 Aug 2013 00:23:51 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754245#comment-13754245

Jeffrey Zhong commented on HBASE-9387:

After checked the log, the root cause is that META region get lost such as Master think it's
assigned while the region server which were hosting the META closed it. In the end, no region
server is hosting META.

As you will see in the following logs, RS successfully opened META region while has issue
to update the ZK data to opened and it then retries. While one of retries went through, Master
deleted the RIT znode. RS failed to set ZK state to opened because the RIT znode is deleted
by Master already and then RS close the META region. Later it also fails to set RIT state
to failed open to notify Master because the RIT znode is deleted. 

Since the issue could happen for any region assignment, I think we can fix the issue as following:

Check Meta to see if current RS owns the region. If current RS owns it, then just proceed
the post open process otherwise doing the logic as today. If we can't get info from META on
this region, we should abort current RS. 

Below are related logs:

2013-08-29 22:15:29,978 DEBUG [RS_OPEN_META-kiyo:57016-0] zookeeper.ZKAssign(786): regionserver:57016-0x140cc24e86a0003
Transitioning 1588230740 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
2013-08-29 22:15:32,746 WARN  [RS_OPEN_META-kiyo:57016-0] zookeeper.RecoverableZooKeeper(258):
Possibly transient ZooKeeper, quorum=localhost:61251, exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/region-in-transition/1588230740
2013-08-29 22:15:32,746 INFO  [RS_OPEN_META-kiyo:57016-0] util.RetryCounter(54): Sleeping
2000ms before retry #1...
2013-08-29 22:15:34,148 DEBUG [AM.ZK.Worker-pool2-t4] master.AssignmentManager(801): Handling
transition=RS_ZK_REGION_OPENED, server=kiyo.gq1.ygridcore.net,57016,1377814510039, region=1588230740,
current state from region state map ={1588230740 state=OPENING, ts=1377814529578, server=kiyo.gq1.ygridcore.net,57016,1377814510039}
2013-08-29 22:15:34,149 DEBUG [AM.ZK.Worker-pool2-t4] zookeeper.ZKAssign(405): master:44745-0x140cc24e86a0001-0x140cc24e86a0001
Deleting existing unassigned node 1588230740 in expected state RS_ZK_REGION_OPENED
2013-08-29 22:15:34,179 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZooKeeperWatcher(312):
master:44745-0x140cc24e86a0001-0x140cc24e86a0001 Received ZooKeeper Event, type=NodeDeleted,
state=SyncConnected, path=/hbase/region-in-transition/1588230740
2013-08-29 22:15:34,760 WARN  [RS_OPEN_META-kiyo:57016-0] zookeeper.ZKAssign(866): regionserver:57016-0x140cc24e86a0003-0x140cc24e86a0003
Attempt to transition the unassigned node for 1588230740 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
failed, the node existed and was in the expected state but then when setting data it no longer
2013-08-29 22:15:34,761 WARN  [RS_OPEN_META-kiyo:57016-0] handler.OpenRegionHandler(370):
Completed the OPEN of region hbase:meta,,1.1588230740 but when transitioning from  OPENING
to OPENED got a version mismatch, someone else clashed so now unassigning -- closing region
on server: kiyo.gq1.ygridcore.net,57016,1377814510039
2013-08-29 22:15:34,761 DEBUG [RS_OPEN_META-kiyo:57016-0] regionserver.HRegion(950): Closing
hbase:meta,,1.1588230740: disabling compactions & flushes
2013-08-29 22:15:34,761 DEBUG [RS_OPEN_META-kiyo:57016-0] regionserver.HRegion(972): Updates
disabled for region hbase:meta,,1.1588230740
2013-08-29 22:15:34,762 INFO  [StoreCloserThread-hbase:meta,,1.1588230740-1] regionserver.HStore(668):
Closed info
2013-08-29 22:15:34,762 INFO  [RS_OPEN_META-kiyo:57016-0] regionserver.HRegion(1030): Closed
2013-08-29 22:15:34,763 INFO  [RS_OPEN_META-kiyo:57016-0] handler.OpenRegionHandler(396):
Opening of region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '',
ENDKEY => ''} failed, transitioning from OPENING to FAILED_OPEN in ZK, expecting version
2013-08-29 22:15:34,763 DEBUG [RS_OPEN_META-kiyo:57016-0] zookeeper.ZKAssign(786): regionserver:57016-0x140cc24e86a0003-0x140cc24e86a0003
Transitioning 1588230740 from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN
2013-08-29 22:15:34,765 DEBUG [RS_OPEN_META-kiyo:57016-0] zookeeper.ZKUtil(784): regionserver:57016-0x140cc24e86a0003-0x140cc24e86a0003
Unable to get data of znode /hbase/region-in-transition/1588230740 because node does not exist
(not necessarily an error)
2013-08-29 22:15:34,765 WARN  [RS_OPEN_META-kiyo:57016-0] handler.OpenRegionHandler(404):
Unable to mark region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY =>
'', ENDKEY => ''} as FAILED_OPEN. It's likely that the master already timed out this open
attempt, and thus another RS already has the region.

> Region could get lost during assignment
> ---------------------------------------
>                 Key: HBASE-9387
>                 URL: https://issues.apache.org/jira/browse/HBASE-9387
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.95.2
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>         Attachments: 9387-v1.txt, org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt
> I observed test timeout running against hadoop 2.1.0 with distributed log replay turned
> Looks like region state for 1588230740 became inconsistent between master and the surviving
region server:
> {code}
> 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] master.RegionStates(299): Onlined
1588230740 on kiyo.gq1.ygridcore.net,57016,1377814510039
> ...
> 2013-08-29 22:15:34,587 DEBUG [Thread-221] client.HConnectionManager$HConnectionImplementation(1269):
locateRegionInMeta parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740,
hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 35 failed; retrying
after sleep of 302 because: org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region
is being opened: 1588230740
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
>         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
>         at org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
>         at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
>         at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
> {code}

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message