hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Brooks <i.bro...@sensewhere.com>
Subject Re: Snapshot failure
Date Fri, 17 Oct 2014 13:12:55 GMT
Hi,

I got a change finally to apply the timeout changes and it lasted longer before it started
failing, but its now just throwing the same errors on "procedure.Subprocedure: Subprocedure
pool is full!"

Looking through the logs, it looks like the nodes may not correctly updating they completion
state after finishing a snapshot,

eg.

2014-09-25 21:03:15,713 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Received procedure start children changed event: /hbase/online-snapshot/acquired
2014-09-25 21:03:15,713 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
2014-09-25 21:03:15,715 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Found procedure znode: /hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
start proc data length is 57
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Found data for znode:/hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager:
Launching subprocedure for snapshot requestData-20140925-210256 from table requestData
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting
new Subprocedure:requestData-20140925-210256
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.Subprocedure: Starting subprocedure 'requestData-20140925-210256' with timeout 600000ms
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in: 600000 ms
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' starting 'acquire' stage
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' locally acquired
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.ZKProcedureMemberRpcs: Member: '###########-005,16020,1411640680181' joining acquired
barrier fo
r procedure (requestData-20140925-210256) in zk
2014-09-25 21:03:15,724 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.ZKProcedureMemberRpcs: Watch for global barrier reached:/hbase/online-snapshot/reached/request
Data-20140925-210256
2014-09-25 21:03:15,739 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' coordinator notified of
'acquire', wa
iting on 'reached' or 'abort' from coordinator
2014-09-25 21:03:15,825 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Received created event:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Recieved reached global barrier:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' received 'reached' from
coordinator.

...

2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
snapshot.RegionServerSnapshotManager: Completed 5 local region snapshots.
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
snapshot.RegionServerSnapshotManager: cancelling 0 tasks for snapshot ###########-005,16020,1411640680181
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' locally completed
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.ZKProcedureMemberRpcs: Marking procedure  'requestData-20140925-210256' completed
for member '
###########-005,16020,1411640680181' in zk
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' has notified controller
of completion
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
errorhandling.TimeoutExceptionInjector: Marking timer as complete - no error notifications
will be recei
ved for this timer.
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1]
procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' completed.
2


Which suggests that "requestData-20140925-210256" completed happily, however, 10 minutes later
the following comes up in the logs


2014-09-25 21:13:15,777 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Received created event:/hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,777 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs:
Aborting procedure member for znode /hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,778 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request
received to abort procedure requestData-20140925-210256
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused
Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException:
Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727,
diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
2014-09-25 21:13:15,780 INFO  [regionserver16020-EventThread] procedure.ProcedureMember: Received
abort on procedure with no local subprocedure requestData-20140925-210256, ignoring it.
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused
Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException:
Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727,
diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)





-Ian Brooks


On Tuesday 23 September 2014 06:52:34 Ted Yu wrote:
> Here're the config parameters related to controlling snapshot timeout:
> 
>   <property>
>     <name>hbase.snapshot.master.timeoutMillis</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the time HBase master waits for the snapshot operation to
> complete.
>       Do not confuse this hbase.snapshot.master.timeout.millis, which
> although
>       sounding similar, serves a very different purpose.
>       Note: This property has a completely different meaning before hbase
> version
>       0.94.11 and should not enabled on a cluster using snapshots and
> running
>       a version before 0.94.11.
>     </description>
>   </property>
>   <property>
>     <name>hbase.snapshot.master.timeout.millis</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the timeout the master indicates the client to wait when it
> takes
>       the snapshot. The client actually waits longer than this due to
> exponential
>       backoff. See HBaseAdmin.snapshot for the exact algorithm.
>     </description>
>   </property>
>   <property>
>     <name>hbase.snapshot.region.timeout</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the time the regionserver waits to complete all of its
> activities
>       for a snapshot operation.
>     </description>
>   </property>

Mime
View raw message