hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8545) Meta stuck in transition when it is assigned to a just restarted dead region sever
Date Fri, 17 May 2013 04:43:15 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660327#comment-13660327
] 

ramkrishna.s.vasudevan commented on HBASE-8545:
-----------------------------------------------

@Jimmy
Going thro the logs 
{code}
[main] master.ServerManager(736): New admin connection to RSVASUDE-MOBL.gar.corp.intel.com,52745,1368765576099
2013-05-17 10:09:54,939 INFO  [PRI IPC Server handler 0 on 52745] regionserver.HRegionServer(3456):
Received request to open region: testAssignRegionOnRestartedServer,A,1368765580368.4c096a61387cfc7d63df0be881875f03.
on RSVASUDE-MOBL.gar.corp.intel.com,52745,1368765576199
{code}
The server ending with 099 is the dead server and the server ending with 199 is the actual
new alive server.  So should we validate this on the RS side and make it FAILED_OPEN so that
master can carry out the assignment with a new plan.  This is a case where an old server goes
down and a new server immediately comes up in the same node.
                
> Meta stuck in transition when it is assigned to a just restarted dead region sever 
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-8545
>                 URL: https://issues.apache.org/jira/browse/HBASE-8545
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: trunk-8545.patch, trunk-8545_v2.patch
>
>
> Support the meta region server is down, and the SSH tries to re-assign it.  This could
happen:
> 1. AM plans to assign meta to a region server (R_old);
> 2. Now R_old is dead, the new region server (R_new) starts up on the same host, port,
but gets a different start code;
> 3. AM sends the open region request to R_new and the Meta is opened on it;
> 4. AM gets ZK event, but it is from a different region server instance (R_new), not the
expected one (R_old), so it sends a close region request to R_new;
> 5. Now, the meta is stuck in transition and won't be assigned.
> This won't happen to a user region since the SSH for R_old will find out the user region
stuck in transition and re-assign it.  For meta, it is a little different.  AM checks if a
dead region server carries the meta based on the ZK info, which is changed to the new region
server R_new at step 3 by the open region handler.
> The fix I was thinking about is:
> 1. In checking if a region server carries a region, uses the region transition information
if it exists (which is the source of truth, to master), if not, checks the ZK data as before;
> 2. In open region handler, when transition assign zk node from offline to opening, make
sure the current region server is the expected one (ZK#transitionNode, existing code doesn't
check the target server name).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message