hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Kennedy <james.kenn...@troove.net>
Subject Can master detect sudden region server death?
Date Fri, 13 Aug 2010 22:59:35 GMT
For our system it is critical that there be no data loss and fast recovery time if any node
goes down.

We've recently updated the hbase-transactional-tableindexed extension to work with the latest
0.89.20100726 version of HBase (still to be pushed).
All HBase tests are passing but then when we started to write our own and test true sudden
HRegionServer death we ran into trouble.
It seems that the HMaster does not recognize the kill even after many minutes.  Client requests
are blocked and the log continues to repeat the logs below.

We realized that HBase's own tests that require RegionServer death use abort() and not kill()
which does enough cleanup to inadequately simulate a sudden (e.g. JVM crash) death.

As an experiment I made HRegionServer.kill() public and modified HBaseMiniCluster to call
that from abort() instead.  Now a test like TestMasterTransitions will exhibit similar behaviour:
 The HMaster never notices the RegionServer is gone. 

Could it really be that sudden region server death is not handled in hbase?
Or more likely is this a failure of the testing framework to adequately simulate kill -9?

James Kennedy
Project Manager
Troove Inc.


-------------------------------

[13/08/10 15:12:12] 259494 [n.serverMonitor] INFO  oop.hbase.master.ServerManager  - 2 region
servers, 0 dead, average load 3.5
[13/08/10 15:12:12] 259560 [ger.metaScanner] INFO  adoop.hbase.master.BaseScanner  - RegionManager.metaScanner
scanning meta region {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey:
<>}
[13/08/10 15:12:12] 259561 [ger.metaScanner] WARN  adoop.hbase.master.BaseScanner  - Scan
one META region: {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: <>}
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:309)
	at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:857)
	at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:725)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:253)
	at $Proxy10.openScanner(Unknown Source)
	at org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:182)
	at org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
	at org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
	at org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:156)
	at org.apache.hadoop.hbase.Chore.run(Chore.java:68)




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message