hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3368) Split message can come in before region opened message; results in 'Region has been PENDING_CLOSE for too long' cycle
Date Thu, 03 Feb 2011 07:24:28 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989993#comment-12989993
] 

stack commented on HBASE-3368:
------------------------------

There is a problem with this 'fix'.  It leaves a region in RIT and its not cleared because
this happens:

{code}
2011-02-03 06:42:51,614 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED
for region 811f9efb3df65b2173d7ce0c80ac2a99 from server sv2borg184,60020,1296715278941 but
region was in  the state null and not in expected PENDING_OPEN or OPENING states
{code}

Above happens because on receipt of the split message, we offline parent which involves:

{code}
  public void regionOffline(final HRegionInfo regionInfo) {
    synchronized(this.regionsInTransition) {
      if (this.regionsInTransition.remove(regionInfo.getEncodedName()) != null) {
        this.regionsInTransition.notifyAll();
      }
    }
    // remove the region plan as well just in case.
    clearRegionPlan(regionInfo);
    setOffline(regionInfo);
  }
{code}

.. i.e. we remove the region from RIT on receipt of RIT though its in OPENING or OPENED state.


> Split message can come in before region opened message; results in 'Region has been PENDING_CLOSE
for too long' cycle
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3368
>                 URL: https://issues.apache.org/jira/browse/HBASE-3368
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> Another good one.  Look at these excerpts from master log:
> {code}
> 2010-12-16 00:49:45,749 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT:
TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b.: Daughters; TestTable,0078922610,1292460584999.c8b95dfc9a671083bafdaa0341279777.,
TestTable,0078933586,  
> 1292460584999.7cc636c9a7274eec4e784df2efebbca3. from XXX185,60020,1292460570976
> ....
> 2010-12-16 00:49:46,132 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
Opened region TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b. on XXX185,60020,1292460570976
> {code}
> ... so the split will have cleared the parent from in-memory data structures and then
the open handler will add them back (though region is offlined, split).
> Then the balancer runs....... only no one is holding the region thats being balanced.
> Over on XXX185 I see the open and then split at these times:
> {code}
> 2010-12-16 00:49:43,740 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Opened TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b.
> .....
> 2010-12-16 00:49:45,003 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting
split of region TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b.
> {code}
> So, the fact that it takes the Master a while to get around to the zk watcher processing
messes us up.
> Root problem is that we're using two different message buses, zk and then heartbeat.
 Intent is to do all over zk and remove hearbeat but looking at what to do for 0.90.0.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message