hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey Zhong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12398) Region isn't assigned in an extreme race condition
Date Fri, 31 Oct 2014 23:53:35 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192770#comment-14192770

Jeffrey Zhong commented on HBASE-12398:

Yes, this is only 0.98 & branch-1. Trunk code change significantly and seems may not happen.

> Region isn't assigned in an extreme race condition
> --------------------------------------------------
>                 Key: HBASE-12398
>                 URL: https://issues.apache.org/jira/browse/HBASE-12398
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.98.7
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>         Attachments: HBASE-12398.patch
> In a test, [~enis] has seen a condition which made one of the regions unassigned. 
> The client failed since the region is not online anywhere: 
> {code}
> 2014-10-29 01:51:40,731 WARN  [HBaseReaderThread_13] util.MultiThreadedReader: org.apache.hadoop.hbase.client.RetriesExhaustedException:
Failed after attempts=35, exceptions:
> Wed Oct 29 01:39:51 UTC 2014, org.apache.hadoop.hbase.client.RpcRetryingCaller@cc21330,
org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException:
Region IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.
is not online on hor8n08.gq1.ygridcore.net,60020,1414546670414
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2774)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4257)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2906)
>         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29990)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
>         at java.lang.Thread.run(Thread.java:722)
> {code}
> The root cause of the issue is due to some extreme race condition:
> a) a region is about to open and receives a closeRpc request triggered by a second re-assignment
> b) the second re-assignment updates region state to offline while immediately is overwritten
to OPEN from previous region open ZK opened notification
> c) when the region reopened on the same RS by the second assignment, AM force the region
to close as the its region state isn't in PendingOpenOrOpening state.  
> d) the region ends up offline & can't server any request
> Region Server Side:
> 1) A region almost opens region 689b77e1bad7e951b0d9ef4663b217e9 while the RS(hor8n10)
receives a closeRegion request.
> {noformat}
> 2014-10-29 01:39:43,153 INFO  [PriorityRpcServer.handler=2,queue=0,port=60020] regionserver.HRegionServer:
Received CLOSE for the region:689b77e1bad7e951b0d9ef4663b217e9 , which we are already trying
to OPEN. Cancelling OPENING.
> {noformat}
> 2) Since region 689b77e1bad7e951b0d9ef4663b217e9 was already opened right before some
final steps, so the RS logs the following message and close 689b77e1bad7e951b0d9ef4663b217e9
immediately after the RS update ZK node state to 'OPENED'.
> {noformat}
> 2014-10-29 01:39:43,198 ERROR [RS_OPEN_REGION-hor8n10:60020-0] handler.OpenRegionHandler:
Race condition: we've finished to open a region, while a close was requested  on region=IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9..
It can be a critical error, as a region that should be closed is now opened. Closing it now
> {noformat}
> In Master Server Side:
> {noformat}
> 2014-10-29 01:39:43,177 DEBUG [AM.ZK.Worker-pool2-t55] master.AssignmentManager: Handling
RS_ZK_REGION_OPENED, server=hor8n10.gq1.ygridcore.net,60020,1414546531945, region=689b77e1bad7e951b0d9ef4663b217e9,
current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPENING, ts=1414546783152, server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
> ....
> 2014-10-29 01:39:43,255 DEBUG [AM.-pool1-t16] master.AssignmentManager: Offline IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.,
it's not any more on hor8n10.gq1.ygridcore.net,60020,1414546531945
> ....
> 2014-10-29 01:39:43,942 DEBUG [AM.ZK.Worker-pool2-t58] master.AssignmentManager: Handling
RS_ZK_REGION_OPENED, server=hor8n10.gq1.ygridcore.net,60020,1414546531945, region=689b77e1bad7e951b0d9ef4663b217e9,
current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPEN, ts=1414546783387, server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
> 2014-10-29 01:39:43,942 WARN  [AM.ZK.Worker-pool2-t58] master.AssignmentManager: Received
OPENED for 689b77e1bad7e951b0d9ef4663b217e9 from hor8n10.gq1.ygridcore.net,60020,1414546531945
but the region isn't PENDING_OPEN/OPENING here: {689b77e1bad7e951b0d9ef4663b217e9 state=OPEN,
ts=1414546783387, server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
> 2014-10-29 01:39:43,948 DEBUG [AM.ZK.Worker-pool2-t58] master.AssignmentManager: Sent
CLOSE to hor8n10.gq1.ygridcore.net,60020,1414546531945 for region IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.
> {noformat}

This message was sent by Atlassian JIRA

View raw message