hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey Zhong (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-9665) Region gets lost when balancer & SSH both trying to assign
Date Thu, 26 Sep 2013 19:41:03 GMT
Jeffrey Zhong created HBASE-9665:
------------------------------------

             Summary: Region gets lost when balancer & SSH both trying to assign 
                 Key: HBASE-9665
                 URL: https://issues.apache.org/jira/browse/HBASE-9665
             Project: HBase
          Issue Type: Bug
          Components: Region Assignment
    Affects Versions: 0.96.0
            Reporter: Jeffrey Zhong
            Priority: Critical


In summary, a server dies and its regions are re-assigned. While right before SSH, balancer
is starting assign one region on the server to somewhere. 

The balancer assignment got preempted by the SSH assignment:
{code}
2013-09-25 11:55:32,854 INFO Priority.RpcServer.handler=7,port=60020 regionserver.HRegionServer:
Received CLOSE for the region:6deb1bfefe8cbdb443084efe919fdeb7 , which we are already trying
to OPEN. Cancelling OPENING.
{code}

The SSH assignment(by GeneralBulkAssigner) failed too due to:
{code}
2013-09-25 11:55:32,927 WARN  [RS_OPEN_REGION-hor15n09:60020-2] zookeeper.ZKAssign: regionserver:60020-0x14153d449d30ad0
Attempt to transition the unassigned node for 6deb1bfefe8cbdb443084efe919fdeb7 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING failed, the server that tried to transition was hor15n09.gq1.ygridcore.net,60020,1380109280320
not the expected hor15n07.gq1.ygridcore.net,60020,1380109890414
{code}

In the end, the region 6deb1bfefe8cbdb443084efe919fdeb7 is lost.


Below is the master log, you can see both balancer and SSH try to assign the region around
the same time:

{code}
2013-09-25 11:55:32,731 INFO  [MASTER_SERVER_OPERATIONS-hor15n05:60000-4] master.RegionStates:
Transitioning {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_CLOSE, ts=1380110132710, server=hor15n12.gq1.ygridcore.net,60020,1380109596307}
will be handled by SSH for hor15n12.gq1.ygridcore.net,60020,1380109596307

...

2013-09-25 11:55:32,849 INFO  [hor15n05.gq1.ygridcore.net,60000,1380108611483-BalancerChore]
master.RegionStates: Transitioned {6deb1bfefe8cbdb443084efe919fdeb7 state=OFFLINE, ts=1380110132768,
server=null} to {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_OPEN, ts=1380110132849, server=hor15n07.gq1.ygridcore.net,60020,1380109890414}

...

2013-09-25 11:55:32,898 INFO  [hor15n05.gq1.ygridcore.net,60000,1380108611483-GeneralBulkAssigner-1]
master.RegionStates: Transitioned {6deb1bfefe8cbdb443084efe919fdeb7 state=OFFLINE, ts=1380110132861,
server=null} to {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_OPEN, ts=1380110132898, server=hor15n09.gq1.ygridcore.net,60020,1380109280320}
{code}

Since SSH force region assignment while it doesn't recreate offline znode, the later region
opening would fail with the following error. I'm suggesting to recreate offline znode when
we force a region assignment(forceNewPlan=true) with low impact.

{code}
2013-09-25 11:55:32,927 WARN  [RS_OPEN_REGION-hor15n09:60020-2] zookeeper.ZKAssign: regionserver:60020-0x14153d449d30ad0
Attempt to transition the unassigned node for 6deb1bfefe8cbdb443084efe919fdeb7 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING failed, the server that tried to transition was hor15n09.gq1.ygridcore.net,60020,1380109280320
not the expected hor15n07.gq1.ygridcore.net,60020,1380109890414
{code}



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message