hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Tretyakov <itretya...@griddynamics.com>
Subject Hbase region servers shuts down unexpectedly
Date Fri, 08 Nov 2013 18:18:50 GMT

We have following issue on our cluster running HBase 0.92.1-cdh4.1.1.
When we start full scan of the table some of servers shuts down
unexpectedly with following lines in the log:

2013-11-07 21:19:12,173 WARN org.apache.hadoop.ipc.HBaseServer:
{"processingtimems":6723,"call":"next(-3171672497308828151, 1000), rpc
version=1, client version=29, methodsFingerPrint=1891768260","client":"
2013-11-07 21:19:33,009 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
20545ms instead of 3000ms, this is likely due to a long garbage collecting
pause and it's usually bad, see
2013-11-07 21:19:41,651 INFO org.apache.hadoop.hbase.util.VersionInfo:
HBase 0.92.1-cdh4.1.1

or one more example:

2013-11-07 22:07:02,587 WARN org.apache.hadoop.ipc.HBaseServer:
{"processingtimems":12540,"call":"next(8031108008798991209, 1000), rpc
version=1, client version=29, methodsFingerPrint=1891768260","client":"
2013-11-07 22:08:00,413 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception for block
java.io.EOFException: Premature EOF: no length prefix available
2013-11-07 22:08:09,394 INFO org.apache.hadoop.hbase.util.VersionInfo:
HBase 0.92.1-cdh4.1.1

Last line ' HBase 0.92.1-cdh4.1.1' is indicating just started new instance
of region server. Every time I see 'responseTooLarge' message before
The job is working with '-caching' option equal to 1000.

My current assumption that problem caused by memory shortage on RS and long
GC pause which cause ZK session to expire and server to shutdown (-Xmx for
RS is 8GB). Then cloudera manager restarts it.

I've tried to run job with '-caching' equal to 1 there were no restarted
servers but job didn't finished within reasonable amount of time. I
understand that decreasing value of caching can mitigate the problem but it
not looks like right way for me, because number of regions per server can
be increased in future and we will have similar problem. And it it will
also slow down the job.

Do you think the problem caused by the same reasons which I assume?
Is that known issue?
What do you think could be the ways to resolve it?
Is there some option to send response when it is becoming too large
independent on caching value?

Thanks in advance for your answers.
I'm ready to provide any additional information you may need to help me
with this issue.

Best Regards
Ivan Tretyakov

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message