hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Kennedy <james.kenn...@troove.net>
Subject Re: Can master detect sudden region server death?
Date Tue, 17 Aug 2010 18:41:19 GMT
Hmm, yeah i waited well over the zk lease time for the HMaster to come out of the exception
loop but it never did.

Thanks, i'll follow up with a look at Gremlins.  How is it used within HBase? Do you continuously
run gremlins on a cluster in an automated way?

James Kennedy
Project Manager
Troove Inc.

1 877 330 8501

On 2010-08-13, at 4:06 PM, Jean-Daniel Cryans wrote:

> The master will get a Watcher event from ZooKeeper when the region
> server's session is expired and its ephemeral znode is deleted. By
> default the session timeout is really high to cope with users with
> huge GC pauses problems, something like 1 minute (see
> hbase-default.xml).
> 
> For a good fault testing framework, please use
> http://github.com/toddlipcon/gremlins. This was written by Todd Lipcon
> to test HBase's handling of region servers' death.
> 
> J-D
> 
> On Fri, Aug 13, 2010 at 3:59 PM, James Kennedy <james.kennedy@troove.net> wrote:
>> For our system it is critical that there be no data loss and fast recovery time if
any node goes down.
>> 
>> We've recently updated the hbase-transactional-tableindexed extension to work with
the latest 0.89.20100726 version of HBase (still to be pushed).
>> All HBase tests are passing but then when we started to write our own and test true
sudden HRegionServer death we ran into trouble.
>> It seems that the HMaster does not recognize the kill even after many minutes.  Client
requests are blocked and the log continues to repeat the logs below.
>> 
>> We realized that HBase's own tests that require RegionServer death use abort() and
not kill() which does enough cleanup to inadequately simulate a sudden (e.g. JVM crash) death.
>> 
>> As an experiment I made HRegionServer.kill() public and modified HBaseMiniCluster
to call that from abort() instead.  Now a test like TestMasterTransitions will exhibit similar
behaviour:  The HMaster never notices the RegionServer is gone.
>> 
>> Could it really be that sudden region server death is not handled in hbase?
>> Or more likely is this a failure of the testing framework to adequately simulate
kill -9?
>> 
>> James Kennedy
>> Project Manager
>> Troove Inc.
>> 
>> 
>> -------------------------------
>> 
>> [13/08/10 15:12:12] 259494 [n.serverMonitor] INFO  oop.hbase.master.ServerManager
 - 2 region servers, 0 dead, average load 3.5
>> [13/08/10 15:12:12] 259560 [ger.metaScanner] INFO  adoop.hbase.master.BaseScanner
 - RegionManager.metaScanner scanning meta region {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192,
startKey: <>}
>> [13/08/10 15:12:12] 259561 [ger.metaScanner] WARN  adoop.hbase.master.BaseScanner
 - Scan one META region: {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey:
<>}
>> java.net.ConnectException: Connection refused
>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>>        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:309)
>>        at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:857)
>>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:725)
>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:253)
>>        at $Proxy10.openScanner(Unknown Source)
>>        at org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:182)
>>        at org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>        at org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>        at org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:156)
>>        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>> 
>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message