hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jerry He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master
Date Thu, 04 Jun 2015 23:07:38 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573756#comment-14573756
] 

Jerry He commented on HBASE-13845:
----------------------------------

The retries see the same "IOException: hbase:meta is onlined on the dead server"
It is simple retry logic.

This is the call sequence:

MetaServerShutdownHandler --> process() --> verifyAndAssignMetaWithRetries() -->
verifyAndAssignMeta()

{code}
  private void verifyAndAssignMetaWithRetries() throws IOException {
    int iTimes = this.server.getConfiguration().getInt(
        "hbase.catalog.verification.retries", 10);

    long waitTime = this.server.getConfiguration().getLong(
        "hbase.catalog.verification.timeout", 1000);

    int iFlag = 0;
    while (true) {
      try {
        verifyAndAssignMeta();
        break;
      } catch (KeeperException e) {
        this.server.abort("In server shutdown processing, assigning meta", e);
        throw new IOException("Aborting", e);
      } catch (Exception e) {
        if (iFlag >= iTimes) {
          this.server.abort("verifyAndAssignMeta failed after" + iTimes
              + " times retries, aborting", e);
          throw new IOException("Aborting", e);
        }
{code}

> Expire of one region server carrying meta can bring down the master
> -------------------------------------------------------------------
>
>                 Key: HBASE-13845
>                 URL: https://issues.apache.org/jira/browse/HBASE-13845
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.1.0
>            Reporter: Jerry He
>
> There seems to be a code bug that can cause expiration of one region server carrying
meta to bring down the master under certain case.
> Here is the sequence of event.
> a) The master detects the expiration of a region server on ZK, and starts to expire the
region server.
> b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries()
during processing the expired rs.
> c)  In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation
> {code}
> (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(),
>       this.server.getZooKeeper(), timeout)) {
>       this.services.getAssignmentManager().assignMeta
>       (HRegionInfo.FIRST_META_REGIONINFO);
>     } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation(
>       this.server.getZooKeeper()))) {
>       throw new IOException("hbase:meta is onlined on the dead server "
>           + serverName);
> {code}
> If we see the meta region is still alive on the expired rs, we throw an exception.
> We do some retries (default 10x1000ms) for verifyAndAssignMeta.
> If we still get the exception after retries, we abort the master.
> {code}
> 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:60000-0] master.HMaster:
Master server abort: loaded coprocessors are: []
> 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:60000-0] master.HMaster:
verifyAndAssignMeta failed after10 times retries, aborting
> java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203
>         at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162)
>         at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184)
>         at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93)
>         at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> 2015-05-27 06:58:30,156 INFO  [MASTER_META_SERVER_OPERATIONS-bdvs1163:60000-0] regionserver.HRegionServer:
STOPPED: verifyAndAssignMeta failed after10 times retries, aborting
> {code}
> The problem happens when the expired is slow processing its own expiration or has a slow
death, and is still able to respond to master's meta verification in the meantime



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message