hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HBase Review Board (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3147) Regions stuck in transition after rolling restart, perpetual timeout handling but nothing happens
Date Tue, 26 Oct 2010 06:27:20 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924856#action_12924856

HBase Review Board commented on HBASE-3147:

Message from: stack@duboce.net

This is an automatically generated e-mail. To reply, visit:

(Updated 2010-10-25 23:25:36.390570)

Review request for hbase and stack.


So, a few things extra after digging in w/ Jon.

1. A watch was not being called on .META. move because it was not being set; in MetaNodeTracker
we were not calling the super inside in nodeDeleted to reset the watch (in rolling restart,
only a few servers would actually experience a moved .META. sensation and it was these that
were hanging up..  Others when they came up would see .META. in its new location)
2. We were not assigning out .META. if master had trouble reaching meta before it saw server
expired.  In the case where we'd trouble contacting meta before we saw its server expire,
we'd reset in the catalog tracker its location.  We were using catalog tracker to determine
which server was hosting meta.  We use a different technique now.


Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-memory master RIT

Adds some new broken RIT states into TestMasterFailover.

Some of these broken states don't seem possible to me but as long as we aren't breaking the
existing behaviors and tests I think it's okay if we handle odd cases that can be mocked.
 Who knows what will happen in the real world.

The reason TestMasterFailover didn't/doesn't really test for the issue in HBASE-3147 is this
new broken condition happens when an RS dies / goes offline rather than a master failover
concurrent w/ RS failure.

v4 of the patch adds to Jons' fixes.  It adds a shutdown server handler for root and another
for meta so the processing of servers hosting meta/root do not get frozen out.  I've seen
this in my testing.

This addresses bug HBASE-3147.

Diffs (updated)

  trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1027351

  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1027351 
  trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1027351 

Diff: http://review.cloudera.org/r/1087/diff


TestMasterFailover passes.



> Regions stuck in transition after rolling restart, perpetual timeout handling but nothing
> -------------------------------------------------------------------------------------------------
>                 Key: HBASE-3147
>                 URL: https://issues.apache.org/jira/browse/HBASE-3147
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>         Attachments: HBASE-3147-v6.patch
> The rolling restart script is great for bringing on the weird stuff.  On my little loaded
cluster if I run it, it horks the cluster and it doesn't recover.  I notice two issues that
need fixing:
> 1. We'll miss noticing that a server was carrying .META. and it never gets assigned --
the shutdown handlers get stuck in perpetual wait on a .META. assign that will never happen.
> 2. Perpetual cycling of the this sequence per region not succesfully assigned:
> {code}
>  2010-10-23 21:37:57,404 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions
in transition timed out:  usertable,user510588360,1287547556587.7f2d92497d2d03917afd574ea2aca55b.
state=PENDING_OPEN,                       ts=1287869814294  45154 2010-10-23 21:37:57,404
INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN or OPENING
for too long, reassigning region=usertable,user510588360,1287547556587.                  
                  7f2d92497d2d03917afd574ea2aca55b.  45155 2010-10-23 21:37:57,404 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x2bd57d1475046a Attempting to transition node 7f2d92497d2d03917afd574ea2aca55b
from RS_ZK_REGION_OPENING to M_ZK_REGION_OFFLINE  45156 2010-10-23 21:37:57,404 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x2bd57d1475046a Attempt to transition the unassigned node for 7f2d92497d2d03917afd574ea2aca55b
from RS_ZK_REGION_OPENING to                 M_ZK_REGION_OFFLINE failed, the node existed
but was in the state M_ZK_REGION_OFFLINE  45157 2010-10-23 21:37:57,404 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region transitioned OPENING to OFFLINE so skipping timeout, region=usertable,user510588360,1287547556587.7f2d92497d2d03917afd574ea2aca55b.
> ,,,
> {code}
> Timeout period again elapses an then same sequence.
> This is what I've been working on.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message