hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Can master detect sudden region server death?
Date Fri, 13 Aug 2010 23:06:44 GMT
The master will get a Watcher event from ZooKeeper when the region
server's session is expired and its ephemeral znode is deleted. By
default the session timeout is really high to cope with users with
huge GC pauses problems, something like 1 minute (see
hbase-default.xml).

For a good fault testing framework, please use
http://github.com/toddlipcon/gremlins. This was written by Todd Lipcon
to test HBase's handling of region servers' death.

J-D

On Fri, Aug 13, 2010 at 3:59 PM, James Kennedy <james.kennedy@troove.net> wrote:
> For our system it is critical that there be no data loss and fast recovery time if any
node goes down.
>
> We've recently updated the hbase-transactional-tableindexed extension to work with the
latest 0.89.20100726 version of HBase (still to be pushed).
> All HBase tests are passing but then when we started to write our own and test true sudden
HRegionServer death we ran into trouble.
> It seems that the HMaster does not recognize the kill even after many minutes.  Client
requests are blocked and the log continues to repeat the logs below.
>
> We realized that HBase's own tests that require RegionServer death use abort() and not
kill() which does enough cleanup to inadequately simulate a sudden (e.g. JVM crash) death.
>
> As an experiment I made HRegionServer.kill() public and modified HBaseMiniCluster to
call that from abort() instead.  Now a test like TestMasterTransitions will exhibit similar
behaviour:  The HMaster never notices the RegionServer is gone.
>
> Could it really be that sudden region server death is not handled in hbase?
> Or more likely is this a failure of the testing framework to adequately simulate kill
-9?
>
> James Kennedy
> Project Manager
> Troove Inc.
>
>
> -------------------------------
>
> [13/08/10 15:12:12] 259494 [n.serverMonitor] INFO  oop.hbase.master.ServerManager  -
2 region servers, 0 dead, average load 3.5
> [13/08/10 15:12:12] 259560 [ger.metaScanner] INFO  adoop.hbase.master.BaseScanner  -
RegionManager.metaScanner scanning meta region {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192,
startKey: <>}
> [13/08/10 15:12:12] 259561 [ger.metaScanner] WARN  adoop.hbase.master.BaseScanner  -
Scan one META region: {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey:
<>}
> java.net.ConnectException: Connection refused
>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:309)
>        at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:857)
>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:725)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:253)
>        at $Proxy10.openScanner(Unknown Source)
>        at org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:182)
>        at org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>        at org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>        at org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:156)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>
>
>
>

Mime
View raw message