hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7799) reassigning region stuck in open still may not work correctly due to leftover ZK node
Date Tue, 12 Feb 2013 18:07:13 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576835#comment-13576835
] 

ramkrishna.s.vasudevan commented on HBASE-7799:
-----------------------------------------------

Found the problem.
Straight forward scenario.
A region is in transition and getting moved to RS1.  RS1 gets killed in between and the state
in zk is RS_ZK_OPENING.
Now SSH kicks in.  Jimmy's HBAS-7701 intelligently picks this region so that SSH can start
assigning it.

The RIT state is forcefully made to CLOSE.  And then the GeneralBulkAssigner kicks in.
Now here try to create a znode with OFFLINE state. But if node exists we silently return.
 
In AM.asyncSetOfflineInZooKeeper()
{code}
try {
      ZKAssign.asyncCreateNodeOffline(watcher, state.getRegion(),
        destination, cb, state);
    } catch (KeeperException e) {
      if (e instanceof NodeExistsException) {
        LOG.warn("Node for " + state.getRegion() + " already exists");
      } else {
        server.abort("Unexpected ZK exception creating/setting node OFFLINE", e);
      }
      return false;
{code}

Now when the new RS2 tries to transition the znode thinking it to be in M_ZK_OFFLINE state
it does not happen. Thus leading to infinite loop.
Patch i will come up later as its late here.  
Pls correct me if am wrong here.

                
> reassigning region stuck in open still may not work correctly due to leftover ZK node
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-7799
>                 URL: https://issues.apache.org/jira/browse/HBASE-7799
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>         Attachments: org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted-output.txt.gz
>
>
> (logs grepped by region name, and abridged.
> META server was dead so OpenRegionHandler for the region took a while, and was interrupted:
> {code}
> 2013-02-08 14:35:01,555 DEBUG [RS_OPEN_REGION-10.11.2.92,64485,1360362800564-2] handler.OpenRegionHandler(255):
Interrupting thread Thread[PostOpenDeployTasks:871d1c3bdf98a2c93b527cb6cc61327d,5,main]
> {code}
> Then master tried to force region offline and reassign:
> {code}
> 2013-02-08 14:35:06,500 INFO  [MASTER_SERVER_OPERATIONS-10.11.2.92,64483,1360362800340-1]
master.RegionStates(347): Found opening region {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
state=OPENING, ts=1360362901596, server=10.11.2.92,64485,1360362800564} to be reassigned by
SSH for 10.11.2.92,64485,1360362800564
> 2013-02-08 14:35:06,500 INFO  [MASTER_SERVER_OPERATIONS-10.11.2.92,64483,1360362800340-1]
master.RegionStates(242): Region {NAME => 'IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.',
STARTKEY => '7333332c', ENDKEY => '7ffffff8', ENCODED => 871d1c3bdf98a2c93b527cb6cc61327d,}
transitioned from {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
state=OPENING, ts=1360362901596, server=10.11.2.92,64485,1360362800564} to {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
state=CLOSED, ts=1360362906500, server=null}
> 2013-02-08 14:35:06,505 DEBUG [10.11.2.92,64483,1360362800340-GeneralBulkAssigner-1]
master.AssignmentManager(1530): Forcing OFFLINE; was={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
state=CLOSED, ts=1360362906500, server=null}
> 2013-02-08 14:35:06,506 DEBUG [10.11.2.92,64483,1360362800340-GeneralBulkAssigner-1]
zookeeper.ZKAssign(176): master:64483-0x13cbbf1025d0000 Async create of unassigned node for
871d1c3bdf98a2c93b527cb6cc61327d with OFFLINE state
> {code}
> But didn't delete the original ZK node?
> {code}
> 2013-02-08 14:35:06,509 WARN  [main-EventThread] master.OfflineCallback(59): Node for
/hbase/region-in-transition/871d1c3bdf98a2c93b527cb6cc61327d already exists
> 2013-02-08 14:35:06,509 DEBUG [main-EventThread] master.OfflineCallback(69): rs={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
state=OFFLINE, ts=1360362906506, server=null}, server=10.11.2.92,64488,1360362800651
> 2013-02-08 14:35:06,512 DEBUG [main-EventThread] master.OfflineCallback$ExistCallback(106):
rs={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
state=OFFLINE, ts=1360362906506, server=null}, server=10.11.2.92,64488,1360362800651
> {code}
> So it went into infinite cycle of failing to assign due to this:
> {code}
> 2013-02-08 14:35:06,517 INFO  [PRI IPC Server handler 7 on 64488] regionserver.HRegionServer(3435):
Received request to open region: IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
on 10.11.2.92,64488,1360362800651
> 2013-02-08 14:35:06,521 WARN  [RS_OPEN_REGION-10.11.2.92,64488,1360362800651-0] zookeeper.ZKAssign(762):
regionserver:64488-0x13cbbf1025d0004 Attempt to transition the unassigned node for 871d1c3bdf98a2c93b527cb6cc61327d
from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state
RS_ZK_REGION_OPENING set by the server [wrong server name redacted, see HBASE-7798]
> {code}
> Transitioning failed-to-open similarly fails.
> It seems like master needs to nuke ZK node unconditionally to offline?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message