hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pere Kyle <p...@whisper.sh>
Subject Re: Hbase Unusable after auto split to 1024 regions
Date Fri, 07 Nov 2014 00:20:24 GMT
So I have another symptom which is quite odd. When trying to take a snapshot of the the table
with no writes coming in (i stopped thrift) it continually times out when trying to flush
(i don’t believe i have the option of non flush in .94). Every single time I get:

ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException:
Snapshot { ss=backup_weaver table=weaver_events type=FLUSH } had an error.  Procedure backup_weaver
{ waiting=[ip-10-227-42-142.us-west-2.compute.internal,60020,1415302661297, ip-10-227-42-252.us-west-2.compute.internal,60020,1415304752318,
ip-10-231-21-106.us-west-2.compute.internal,60020,1415306503049, ip-10-230-130-102.us-west-2.compute.internal,60020,1415296951057,
ip-10-231-138-119.us-west-2.compute.internal,60020,1415303920176, ip-10-224-53-183.us-west-2.compute.internal,60020,1415311138483,
ip-10-250-1-140.us-west-2.compute.internal,60020,1415311984665, ip-10-227-40-150.us-west-2.compute.internal,60020,1415313275623,
ip-10-231-139-198.us-west-2.compute.internal,60020,1415295324957, ip-10-250-77-76.us-west-2.compute.internal,60020,1415297345932,
ip-10-248-42-35.us-west-2.compute.internal,60020,1415312717768, ip-10-227-45-74.us-west-2.compute.internal,60020,1415296135484,
ip-10-227-43-49.us-west-2.compute.internal,60020,1415303176867, ip-10-230-130-121.us-west-2.compute.internal,60020,1415294726028,
ip-10-224-49-168.us-west-2.compute.internal,60020,1415312488614, ip-10-227-0-82.us-west-2.compute.internal,60020,1415301974178,
ip-10-224-0-167.us-west-2.compute.internal,60020,1415309549108] done=[] }
	at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:362)
	at org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2313)
	at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:354)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1434)
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@239e8159:org.apache.hadoop.hbase.errorhandling.TimeoutException:
Timeout elapsed! Source:Timeout caused Foreign Exception Start:1415319201016, End:1415319261016,
diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
	at org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285)
	at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:352)
	... 6 more
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout
caused Foreign Exception Start:1415319201016, End:1415319261016, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)


I do not have a single write coming in so how in the world could these tables not be flushed?
I could understand a error maybe the first time or two, but how could it not be flushed after
a couple requests? Now I can’t even get the data off the node to a new cluster. Any help
would be greatly appreciated. 

Thanks,
-Pere



On Nov 6, 2014, at 2:09 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:

> One other thought: you might try tracing your requests to see where the
> slowness happens. Recent versions of PerformanceEvaluation support this
> feature and can be used directly or as an example for adding tracing to
> your application.
> 
> On Thursday, November 6, 2014, Pere Kyle <pere@whisper.sh> wrote:
> 
>> Bryan,
>> 
>> Thanks again for the incredibly useful reply.
>> 
>> I have confirmed that the callQueueLen is in fact 0, with a max value of 2
>> in the last week (in ganglia)
>> 
>> hbase.hstore.compaction.max was set to 15 on the nodes, from a previous 7.
>> 
>> Freezes (laggy responses) on the cluster are frequent and affect both
>> reads and writes. I noticed iowait on the nodes that spikes.
>> 
>> The cluster goes between a state of working 100% to nothing
>> serving/timeouts for no discernible reason.
>> 
>> Looking through the logs I have tons of responseTooSlow, this is the only
>> regular occurrence in the logs:
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 39
>> on 60020): (responseTooSlow):
>> {"processingtimems":14573,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@c67b2ac),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.231.139.198:57223
>> ","starttimems":1415246057066,"queuetimems":20640,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 42
>> on 60020): (responseTooSlow):
>> {"processingtimems":45660,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6c034090),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.231.21.106:41126
>> ","starttimems":1415246025979,"queuetimems":202,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 46
>> on 60020): (responseTooSlow):
>> {"processingtimems":14620,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4fc3bb1f),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.230.130.102:54068
>> ","starttimems":1415246057021,"queuetimems":27565,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 35
>> on 60020): (responseTooSlow):
>> {"processingtimems":13431,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3b321922),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.227.42.252:60493
>> ","starttimems":1415246058210,"queuetimems":1134,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> On Nov 6, 2014, at 12:38 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
>> <javascript:;>> wrote:
>> 
>>> blockingStoreFiles
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message