hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juhani Connolly <juh...@ninja.co.jp>
Subject Slow recovery on lost data node?
Date Wed, 08 Dec 2010 08:56:10 GMT
Hi there,

We're currently running a cluster under expected load, and testing 
various hardware failure cases. Among them is a lost 
regionServer/dataNode, which results in our writer process(in our case a 
servlet under tomcat) just waiting indefinitely on put flushes until the 
region becomes available again(in the process the threads stack up until 
the server limit). I've included logs of the relevant time period from 
one of my regionservers at http://pastie.org/1358217 .

During the 15minutes from around 16:12->16:27 all writes failed. 
Incidentally, during this time I am still able to read data fine with 
another process which is only reading from hbase.

Is this period of not being available to write to for 15 working as 
intended, or is something wrong with the way I'm trying to access hbase? 
The main access code I'm using can be seen at http://pastie.org/1358224 
. tPool is an initialised HTablePool, and the general idea is to store 
puts without flushing until they have been held onto for a while(to 
batch the flushes a little bit)

If it is working as intended, what would be the correct steps to reduce 
it(perhaps reducing configuration for region sizes)?

Is there anything I can do to just make the writes fail when the region 
isn't available for writing? As is, threads keep getting generated till 
the container max is reached, waiting for something(presumably the 
region to become available again?). I expected that 
hbase.client.retries.number would be appropriate, but based on the lack 
of any logs for failed writes, the current writes simply aren't aborting.

Everything is running off the latest CDH3(hbase-0.89.20100924+28, 
hadoop-0.20.2+737-core) and works well under normal conditions

Any advice/information would be appreciated.
Thanks,
  Juhani




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message