hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Brooks <i.bro...@sensewhere.com>
Subject Re: Snapshot failure
Date Tue, 23 Sep 2014 13:05:19 GMT
Hi,

I'm running snapshots on 7 tables every hour, the last run managed 15 rounds of snapshots
before starting to fail.

I did notice the following while getting the success count

2014-09-10 11:01:55,453 DEBUG [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
snapshot.FlushSnapshotSubprocedure: Closing region operation on rawScans,250000000000,1408114528441.ddb14a3d8f90cc18b6dfd4f28df5c534.
2014-09-10 11:01:55,453 DEBUG [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
snapshot.FlushSnapshotSubprocedure: Starting region operation on rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
2014-09-10 11:01:55,453 DEBUG [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
snapshot.FlushSnapshotSubprocedure: Flush Snapshotting region rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
started...
2014-09-10 11:01:55,454 WARN  [member: '##############,16020,1410126483043' subprocedure-pool1-thread-1]
snapshot.RegionServerSnapshotManager: Rethrowing ForeignException from SnapshotSubprocedurePool
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.errorhandling.TimeoutException
via timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException:
Timeout elapsed! Source:Timeout caused Foreign Exception Start:1410343254867, End:1410343314867,
diff:60000, max:60000 ms
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:188)
        at org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:295)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
        at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
        at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException:
Timeout elapsed! Source:Timeout caused Foreign Exception Start:1410343254867, End:1410343314867,
diff:60000, max:60000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
        at org.apache.hadoop.hbase.regionserver.HRegion.addRegionToSnapshot(HRegion.java:2816)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:83)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:65)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        ... 4 more
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout
caused Foreign Exception Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:67)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

Is there a configuration option that I need to increase to allow the snapshots to run for
more than 60 seconds on a given region? That said, I'm guessing the above shouldn't cause
the SubProcedure pool to fill up completely.

-Ian

On Tuesday 23 September 2014 04:33:39 Ted Yu wrote:
> Subprocedure pool is full. 
> 
> How many snapshot requests did you submit before seeing the following ?
> 
> Cheers
> 
> On Sep 23, 2014, at 2:28 AM, Ian Brooks <i.brooks@sensewhere.com> wrote:
> 
> > Hi,
> > 
> > I'm seeing an issue on our hbase cluster which is preventing snapshots from working.
So far the only way i can get it working again is to restart all the regionservers which is
not ideal.
> > 
> > I'm running hbase 0.98.3 on hadoop 2.4.0, the stacktraces are
> > 
> > From host given the snapshot command via hbase shell 
> > 
> > 2014-09-23 10:00:16,116 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Received procedure start children changed event: /hbase/online-snapshot/acquired
> > 2014-09-23 10:00:16,116 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
> > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Found procedure znode: /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
start proc data length is 59
> > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Found data for znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager:
Launching subprocedure for snapshot offlineScans-20140923-100009 from table offlineScans
> > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember:
Submitting new Subprocedure:offlineScans-20140923-100009
> > 2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314]
procedure.Subprocedure: Starting subprocedure 'offlineScans-20140923-100009' with timeout
60000ms
> > 2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314]
errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in: 60000 ms
> > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314]
procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' starting 'acquire' stage
> > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314]
procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' locally acquired
> > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314]
procedure.ZKProcedureMemberRpcs: Member: '##########,16020,1410126483043' joining acquired
barrier for procedure (offlineScans-20140923-100009) in zk
> > 2014-09-23 10:00:16,123 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314]
procedure.ZKProcedureMemberRpcs: Watch for global barrier reached:/hbase/online-snapshot/reached/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,124 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314]
procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' coordinator notified of
'acquire', waiting on 'reached' or 'abort' from coordinator
> > 2014-09-23 10:00:16,133 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Received created event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Aborting procedure member for znode /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember:
Request received to abort procedure offlineScans-20140923-100009
> > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via ###############,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:

> >        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> >        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> >        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> >        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:

> >        at org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> >        at org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> >        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> >        ... 2 more
> > 
> > 
> > 
> > From the regionserver it was trying to contact
> > 
> > 
> > 2014-09-23 10:00:16,124 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Received procedure start children changed event: /hbase/online-snapshot/acquired
> > 2014-09-23 10:00:16,124 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
> > 2014-09-23 10:00:16,125 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Found procedure znode: /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
start proc data length is 59
> > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Found data for znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager:
Launching subprocedure for snapshot offlineScans-20140923-100009 from table offlineScans
> > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember:
Submitting new Subprocedure:offlineScans-20140923-100009
> > 2014-09-23 10:00:16,128 ERROR [regionserver16020-EventThread] procedure.Subprocedure:
Subprocedure pool is full!
> > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] errorhandling.ForeignExceptionDispatcher:
offlineScans-20140923-100009 accepting received exception
> > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Aborting procedure (offlineScans-20140923-100009) in zk
> > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Finished creating abort znode:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,134 ERROR [regionserver16020-EventThread] procedure.ProcedureMember:
Failed to start subprocedure 'offlineScans-20140923-100009'
> > 2014-09-23 10:00:16,134 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Received created event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Aborting procedure member for znode /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,145 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember:
Request received to abort procedure offlineScans-20140923-100009
> > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via ###########,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> >        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> >        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> >        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> >        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> >        at org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> >        at org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> >        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> >        ... 2 more
> > 
> > Any ideas on how I can resolve this?
> > 
> > -Ian
> > 

Mime
View raw message