hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Tretyakov <itretya...@griddynamics.com>
Subject Re: Hbase region servers shuts down unexpectedly
Date Mon, 18 Nov 2013 13:10:48 GMT
Thank you for the answer Ted.

We were able to fix the issue by tuning
hbase.client.scanner.max.result.size parameter.

P.S. "The HBase development team has affectionately dubbed this scenario a
Juliet Pause — the master (Romeo) presumes the region server (Juliet) is
dead when it’s really just sleeping, and thus takes some drastic action
(recovery). When the server wakes up, it sees that a great mistake has been
made and takes its own life. Makes for a good play, but a pretty awful
failure scenario!"


On Fri, Nov 8, 2013 at 10:26 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Have you tried using setBatch() to limit the number of columns returned ?
>
> See code example in 9.4.4.3. of
> http://hbase.apache.org/book.html#client.filter.kvm
>
>
> On Fri, Nov 8, 2013 at 10:18 AM, Ivan Tretyakov <
> itretyakov@griddynamics.com
> > wrote:
>
> > Hello!
> >
> > We have following issue on our cluster running HBase 0.92.1-cdh4.1.1.
> > When we start full scan of the table some of servers shuts down
> > unexpectedly with following lines in the log:
> >
> > 2013-11-07 21:19:12,173 WARN org.apache.hadoop.ipc.HBaseServer:
> > (responseTooLarge):
> > {"processingtimems":6723,"call":"next(-3171672497308828151, 1000), rpc
> > version=1, client version=29, methodsFingerPrint=1891768260","client":"
> > 10.0.241.99:43063
> >
> >
> ","starttimems":1383859145449,"queuetimems":0,"class":"HRegionServer","responsesize":1059073884,"method":"next"}
> > 2013-11-07 21:19:33,009 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept
> > 20545ms instead of 3000ms, this is likely due to a long garbage
> collecting
> > pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-11-07 21:19:41,651 INFO org.apache.hadoop.hbase.util.VersionInfo:
> > HBase 0.92.1-cdh4.1.1
> >
> > or one more example:
> >
> > 2013-11-07 22:07:02,587 WARN org.apache.hadoop.ipc.HBaseServer:
> > (responseTooLarge):
> > {"processingtimems":12540,"call":"next(8031108008798991209, 1000), rpc
> > version=1, client version=29, methodsFingerPrint=1891768260","client":"
> > 10.0.240.211:33538
> >
> >
> ","starttimems":1383862010045,"queuetimems":14955,"class":"HRegionServer","responsesize":1322737704,"method":"next"}
> > 2013-11-07 22:08:00,413 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception for block
> >
> BP-1892992341-10.10.122.111-1352825964285:blk_-2134516062062022634_68425527
> > java.io.EOFException: Premature EOF: no length prefix available
> >         at
> >
> >
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
> >         at
> >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:114)
> >         at
> >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:670)
> > 2013-11-07 22:08:09,394 INFO org.apache.hadoop.hbase.util.VersionInfo:
> > HBase 0.92.1-cdh4.1.1
> >
> > Last line ' HBase 0.92.1-cdh4.1.1' is indicating just started new
> instance
> > of region server. Every time I see 'responseTooLarge' message before
> > shutdown.
> > The job is working with '-caching' option equal to 1000.
> >
> > My current assumption that problem caused by memory shortage on RS and
> long
> > GC pause which cause ZK session to expire and server to shutdown (-Xmx
> for
> > RS is 8GB). Then cloudera manager restarts it.
> >
> > I've tried to run job with '-caching' equal to 1 there were no restarted
> > servers but job didn't finished within reasonable amount of time. I
> > understand that decreasing value of caching can mitigate the problem but
> it
> > not looks like right way for me, because number of regions per server can
> > be increased in future and we will have similar problem. And it it will
> > also slow down the job.
> >
> > Do you think the problem caused by the same reasons which I assume?
> > Is that known issue?
> > What do you think could be the ways to resolve it?
> > Is there some option to send response when it is becoming too large
> > independent on caching value?
> >
> > Thanks in advance for your answers.
> > I'm ready to provide any additional information you may need to help me
> > with this issue.
> >
> > --
> > Best Regards
> > Ivan Tretyakov
> >
>



-- 
Best Regards
Ivan Tretyakov

Deployment Engineer
Grid Dynamics
+7 812 640 38 76
Skype: ivan.v.tretyakov
www.griddynamics.com
itretyakov@griddynamics.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message