hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aditya Kishore (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10272) Cluster becomes in-operational if the node hosting the active Master AND ROOT/META table goes offline
Date Fri, 03 Jan 2014 22:20:51 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861956#comment-13861956
] 

Aditya Kishore commented on HBASE-10272:
----------------------------------------

Couldn't find a way to simulate the entire host becoming offline at once. All the kill() and
abort() methods close the regions which cleans up the information in ZK which leads up to
this situation.

> Cluster becomes in-operational if the node hosting the active Master AND ROOT/META table
goes offline
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10272
>                 URL: https://issues.apache.org/jira/browse/HBASE-10272
>             Project: HBase
>          Issue Type: Bug
>          Components: IPC/RPC
>    Affects Versions: 0.94.15
>            Reporter: Aditya Kishore
>            Assignee: Aditya Kishore
>            Priority: Critical
>         Attachments: HBASE-10272_0.94.patch
>
>
> Since HBASE-6364, HBase client caches a connection failure to a server and any subsequent
attempt to connect to the server throws a {{FailedServerException}}
> Now if a node which hosted the active Master AND ROOT/META table goes offline, the newly
anointed Master's initial attempt to connect to the dead region server will fail with {{NoRouteToHostException}}
which it handles but since on second attempt crashes with {{FailedServerException}}
> Here is the log from one such occurance
> {noformat}
> 2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort:
loaded coprocessors are: []
> 2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception.
Starting shutdown.
> org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the
failed servers list: xxx02/192.168.1.102:60020
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
>         at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
>         at $Proxy9.getProtocolVersion(Unknown Source)
>         at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
>         at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1335)
>         at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1294)
>         at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1281)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:506)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:383)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:445)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnection(CatalogTracker.java:464)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:624)
>         at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:684)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:560)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:376)
>         at java.lang.Thread.run(Thread.java:662)
> 2013-11-20 10:58:00,162 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> 2013-11-20 10:58:00,162 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000
> {noformat}
> Each of the backup master will crash with same error and restarting them will have the
same effect. Once this happens, the cluster will remain in-operational until the node with
region server is brought online (or the Zookeeper node containing the root region server and/or
META entry from the ROOT table is deleted).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message