hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10272) Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline
Date Sat, 04 Jan 2014 05:54:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862223#comment-13862223
] 

Hudson commented on HBASE-10272:
--------------------------------

SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #52 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/52/])
HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META
table goes offline (Tedyu: rev 1555313)
* /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java


> Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table
goes offline
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10272
>                 URL: https://issues.apache.org/jira/browse/HBASE-10272
>             Project: HBase
>          Issue Type: Bug
>          Components: IPC/RPC
>    Affects Versions: 0.96.1, 0.94.15
>            Reporter: Aditya Kishore
>            Assignee: Aditya Kishore
>            Priority: Critical
>             Fix For: 0.98.0, 0.99.0
>
>         Attachments: HBASE-10272.patch, HBASE-10272_0.94.patch
>
>
> Since HBASE-6364, HBase client caches a connection failure to a server and any subsequent
attempt to connect to the server throws a {{FailedServerException}}
> Now if a node which hosted the active Master AND ROOT/META table goes offline, the newly
anointed Master's initial attempt to connect to the dead region server will fail with {{NoRouteToHostException}}
which it handles but since on second attempt crashes with {{FailedServerException}}
> Here is the log from one such occurance
> {noformat}
> 2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort:
loaded coprocessors are: []
> 2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception.
Starting shutdown.
> org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the
failed servers list: xxx02/192.168.1.102:60020
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
>         at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
>         at $Proxy9.getProtocolVersion(Unknown Source)
>         at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
>         at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1335)
>         at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1294)
>         at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1281)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:506)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:383)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:445)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnection(CatalogTracker.java:464)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:624)
>         at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:684)
>         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:560)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:376)
>         at java.lang.Thread.run(Thread.java:662)
> 2013-11-20 10:58:00,162 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> 2013-11-20 10:58:00,162 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000
> {noformat}
> Each of the backup master will crash with same error and restarting them will have the
same effect. Once this happens, the cluster will remain in-operational until the node with
region server is brought online (or the Zookeeper node containing the root region server and/or
META entry from the ROOT table is deleted).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message