hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameya Kantikar <am...@groupon.com>
Subject Region servers going down under heavy write load
Date Wed, 05 Jun 2013 20:47:22 GMT
Hi,

We have heavy map reduce write jobs running against our cluster. Every once
in a while, we see a region server going down.

We are on : 0.94.2-cdh4.2.0, r

We have done some tuning for heavy map reduce jobs, and have increased
scanner timeouts, lease timeouts, have also tuned memstore as follows:

hbase.hregion.memstore.block.multiplier: 4
hbase.hregion.memstore.flush.size: 134217728
hbase.hstore.blockingStoreFiles: 100

So now, we are still facing issues. Looking at the logs it looks like due
to zoo keeper timeout. We have tuned zookeeper settings as follows on
hbase-sie.xml:

zookeeper.session.timeout: 300000
hbase.zookeeper.property.tickTime: 6000


The actual log looks like:


2013-06-05 11:46:40,405 WARN org.apache.hadoop.ipc.HBaseServer:
(responseTooSlow):
{"processingtimems":13468,"call":"next(6723331143689528698, 1000), rpc
version=1, client version=29, methodsFingerPrint=54742778","client":"
10.20.73.65:41721
","starttimems":1370432786933,"queuetimems":1,"class":"HRegionServer","responsesize":39611416,"method":"next"}

2013-06-05 11:46:54,988 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor [.snappy]

2013-06-05 11:48:03,017 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
BP-53741567-10.20.73.56-1351630463427:blk_9026156240355850298_8775246
java.io.EOFException: Premature EOF: no length prefix available
        at
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
        at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:95)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:656)

2013-06-05 11:48:03,020 WARN org.apache.hadoop.hbase.util.Sleeper: *We
slept 48686ms instead of 3000ms*, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2013-06-05 11:48:03,094 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
smartdeals-hbase14-snc1.snc1,60020,1370373396890: Unhandled exception:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing smartdeals-hbase14-snc1.snc1,60020,1370373396890 as
dead server

(Not sure why it says 3000ms when we have timeout at 300000ms)

We have done some GC tuning as well. Wondering what I can tune from making
RS going down? Any ideas?
This is batch heavy cluster, and we care less about read latency. We can
increase RAM bit more but not much (Already RS has 20GB memory)

Thanks in advance.

Ameya

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message