hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting
Date Fri, 25 Jun 2010 17:13:55 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882638#action_12882638
] 

Jean-Daniel Cryans commented on HBASE-2707:
-------------------------------------------

So actually the code of process is looks like:

{code}
LOG.info("Log split complete, meta reassignment and scanning:");
    if (this.isRootServer) {
      LOG.info("ProcessServerShutdown reassigning ROOT region");
      master.getRegionManager().reassignRootRegion();
      isRootServer = false;  // prevent double reassignment... heh.
    }

    for (MetaRegion metaRegion : metaRegions) {
      LOG.info("ProcessServerShutdown setting to unassigned: " + metaRegion.toString());
      master.getRegionManager().setUnassigned(metaRegion.getRegionInfo(), true);
    }
    // one the meta regions are online, "forget" about them.  Since there are explicit
    // checks below to make sure meta/root are online, this is likely to occur.
    metaRegions.clear();

    if (!rootAvailable()) {
      // Return true so that worker does not put this request back on the
      // toDoQueue.
      // rootAvailable() has already put it on the delayedToDoQueue
      return true;
    }

    if (!rootRescanned) {
      // Scan the ROOT region
      Boolean result = new ScanRootRegion(
          new MetaRegion(master.getRegionManager().getRootRegionLocation(),
              HRegionInfo.ROOT_REGIONINFO), this.master).doWithRetries();
      if (result == null) {
        // Master is closing - give up
        return true;
      }

      if (LOG.isDebugEnabled()) {
        LOG.debug("Process server shutdown scanning root region on " +
          master.getRegionManager().getRootRegionLocation().getBindAddress() +
          " finished " + Thread.currentThread().getName());
      }
      rootRescanned = true;
    }
{code}

So if the RS had -ROOT-, it will be reassigned right away and then the method returns if !rootAvailable.
Later when we come back and root was assigned, process server shutdown will finish its job.
This is how the code you pasted succeeds.

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a
GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old
log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed
processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed
todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831
is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT-
isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT-
isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131):
1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT-
isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT-
isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't
online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message