hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tianying Chang <tych...@gmail.com>
Subject Re: Failed to take snapshot due to some region directory is not found
Date Tue, 19 May 2015 20:45:57 GMT
Matteo


By looking at the DEBUG log at RS side, it seems to me that no online
regions were pickedup. So it seems to me that this call returns 0 regions.
But I am not sure how that happens. Is there anyway to verify this?

 involvedRegions = getRegionsToSnapshot(snapshot);

2015-05-19 20:36:22,223 DEBUG
org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager:
cancelling 0 tasks for snapshot



Thanks
Tian-Ying

Full log:

2015-05-19 20:35:46,684 INFO
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure
start children changed event: /hbase/online-snapshot/acquired
2015-05-19 20:35:46,686 DEBUG
org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager:
Launching subprocedure for snapshot ss_rich_pin_data_v1 from table
rich_pin_data_v1
2015-05-19 20:36:21,723 INFO
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received created
event:/hbase/online-snapshot/abort/ss_rich_pin_data_v1
2015-05-19 20:36:21,724 ERROR
org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign
exception to subprocedure ss_rich_pin_data_v1
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
timer-java.util.Timer@5add830c:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed!
Source:Timeout caused Foreign Exception Start:1432067746683,
End:1432067781681, diff:34998, max:35000 ms
at
org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:95)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:290)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
Caused by:
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed!
Source:Timeout caused Foreign Exception Start:1432067746683,
End:1432067781681, diff:34998, max:35000 ms
at
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2015-05-19 20:36:21,724 INFO
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure
abort children changed event: /hbase/online-snapshot/abort
2015-05-19 20:36:21,726 ERROR
org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign
exception to subprocedure ss_rich_pin_data_v1
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
timer-java.util.Timer@5add830c:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed!
Source:Timeout caused Foreign Exception Start:1432067746683,
End:1432067781681, diff:34998, max:35000 ms
at
org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:143)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:56)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:111)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:311)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
Caused by:
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed!
Source:Timeout caused Foreign Exception Start:1432067746683,
End:1432067781681, diff:34998, max:35000 ms
at
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2015-05-19 20:36:21,780 INFO
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure
start children changed event: /hbase/online-snapshot/acquired
2015-05-19 20:36:21,784 INFO
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure
abort children changed event: /hbase/online-snapshot/abort
2015-05-19 20:36:22,221 ERROR
org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure
'ss_rich_pin_data_v1' aborting due to a ForeignException!
org.apache.hadoop.hbase.errorhandling.TimeoutException via
timer-java.util.Timer@25ff6d2a:org.apache.hadoop.hbase.errorhandling.TimeoutException:
Timeout elapsed! Source:Timeout caused Foreign Exception
Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms
at
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
at
org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369)
at
org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296)
at
org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170)
at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
elapsed! Source:Timeout caused Foreign Exception Start:1432067746687,
End:1432067781721, diff:35034, max:35000 ms
at
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2015-05-19 20:36:22,223 ERROR
org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure
'ss_rich_pin_data_v1' aborting due to a ForeignException!
org.apache.hadoop.hbase.errorhandling.TimeoutException via
timer-java.util.Timer@25ff6d2a:org.apache.hadoop.hbase.errorhandling.TimeoutException:
Timeout elapsed! Source:Timeout caused Foreign Exception
Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms
at
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
at
org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369)
at
org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296)
at
org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170)
at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
elapsed! Source:Timeout caused Foreign Exception Start:1432067746687,
End:1432067781721, diff:35034, max:35000 ms
at
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2015-05-19 20:36:22,223 INFO
org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure:
Aborting all online FLUSH snapshot subprocedure task threads for
'ss_rich_pin_data_v1' due to error
org.apache.hadoop.hbase.errorhandling.TimeoutException via
timer-java.util.Timer@25ff6d2a:org.apache.hadoop.hbase.errorhandling.TimeoutException:
Timeout elapsed! Source:Timeout caused Foreign Exception
Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms
at
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
at
org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369)
at
org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296)
at
org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170)
at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
elapsed! Source:Timeout caused Foreign Exception Start:1432067746687,
End:1432067781721, diff:35034, max:35000 ms
at
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2015-05-19 20:36:22,223 DEBUG
org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager:
cancelling 0 tasks for snapshot

On Tue, May 19, 2015 at 1:30 PM, Tianying Chang <tychang@gmail.com> wrote:

> Matteo
>
> We are using hdfs2.0 + HBase94.7.
>
> I saw this ArrayIndexOutOfBoundsException: 2 error also. What does that
> mean?
>
> BTW, other tables (but those are smaller in terms of region count) in this
> same cluster is able to create snapshot, only this table is failing.
>
> Thanks
> Tian-Ying
>
> On Tue, May 19, 2015 at 11:50 AM, Matteo Bertozzi <theo.bertozzi@gmail.com
> > wrote:
>
>> can you debug the protobuf problem, I think we abort because we are not
>> able to write
>>
>> 2015-05-19 06:00:49,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
>> handler 50 on 60000 caught: java.lang.ArrayIndexOutOfBoundsException: 2
>>         at java.util.Arrays$ArrayList.get(Arrays.java:3381)
>>         at
>> java.util.Collections$UnmodifiableList.get(Collections.java:1152)
>>         at
>>
>> org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99)
>> ...
>> com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86)
>>         at
>>
>> org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72)
>>         at java.lang.String.valueOf(String.java:2826)
>>         at java.lang.StringBuilder.append(StringBuilder.java:115)
>>         at
>> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152)
>>         at
>>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304)
>>
>> Matteo
>>
>>
>> On Tue, May 19, 2015 at 11:35 AM, Tianying Chang <tychang@gmail.com>
>> wrote:
>>
>> > Actually, I find it does not even print out the debug info below for
>> this
>> > table, other table will print out this logging. So it seems it did not
>> > invoke the FlushSnapshotSubprocedure at all.
>> >
>> >
>> >  @Override
>> >     public Void call() throws Exception {
>> >       // Taking the region read lock prevents the individual region from
>> > being closed while a
>> >       // snapshot is in progress.  This is helpful but not sufficient
>> for
>> > preventing races with
>> >       // snapshots that involve multiple regions and regionservers.  It
>> is
>> > still possible to have
>> >       // an interleaving such that globally regions are missing, so we
>> > still need the verification
>> >       // step.
>> >       LOG.debug("Starting region operation on " + region);
>> >
>> > On Tue, May 19, 2015 at 11:26 AM, Tianying Chang <tychang@gmail.com>
>> > wrote:
>> >
>> > > Hi, Esteban,
>> > >
>> > > There is no region split in this cluster, since we put the region size
>> > > upper bound to be really high to prevent splitting.
>> > >
>> > > I think it happens for all the regions of this table.
>> > >
>> > > I repeatedly run "hdfs dfs -lsr
>> > > /hbase/.hbase-snapshot/ss_rich_pin_data_v1"  while taking snapshot, no
>> > > region was able to write into this direction. I also turn on DEBUG
>> > logging
>> > > on RS, all RS  just report fail with Timeout, with no specific reason.
>> > >
>> > > Thanks
>> > > Tian-Ying
>> > >
>> > > On Tue, May 19, 2015 at 11:06 AM, Esteban Gutierrez <
>> > esteban@cloudera.com>
>> > > wrote:
>> > >
>> > >> Hi Tianying,
>> > >>
>> > >> Is this happening consistently in this region or is it happening
>> > randomly
>> > >> across other regions too? One possibility is that there was a split
>> > going
>> > >> on at the time you started to take the snapshot and it failed. If you
>> > look
>> > >> into /hbase/rich_pin_data_v1 can you find a directory named
>> > >> dff681880bb2b23d0351d6656a1dbbb9 in there?
>> > >>
>> > >> cheers,
>> > >> esteban.
>> > >>
>> > >>
>> > >> --
>> > >> Cloudera, Inc.
>> > >>
>> > >>
>> > >> On Mon, May 18, 2015 at 11:12 PM, Tianying Chang <tychang@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> > We have a cluster that used to be able to take snapshot. But
>> recently,
>> > >> one
>> > >> > table failed due to the error below. Other tables on the same
>> clusters
>> > >> are
>> > >> > fine.
>> > >> >
>> > >> > Any idea what could go wrong? Is the table not healthy? But I
run
>> > hbase
>> > >> > hbck, it reports cluster healthy.
>> > >> >
>> > >> > BTW, we are running 94.7, so we need to take snapshot of the data
>> to
>> > >> export
>> > >> > to a new cluster of 94.26 as part of upgrade (and eventually
>> upgrade
>> > to
>> > >> > 1.x)
>> > >> >
>> > >> > Thanks
>> > >> > Tian-Ying
>> > >> >
>> > >> >
>> > >> > 015-05-19 06:00:45,505 ERROR
>> > >> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: Failed
>> > >> taking
>> > >> > snapshot { ss=ss_rich_pin_data_v1 table=rich_pin_data_v1
>> > type=SKIPFLUSH
>> > >> }
>> > >> > due to exception:No region directory found for region:{NAME =>
>> > >> >
>> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.',
>> > >> > STARTKEY => '', ENDKEY => '001ff3a165ff571471603035ca7b4be9',
>> ENCODED
>> > =>
>> > >> > dff681880bb2b23d0351d6656a1dbbb9,}
>> > >> > org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: No
>> region
>> > >> > directory found for region:{NAME =>
>> > >> >
>> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.',
>> > >> > STARTKEY => '', ENDKEY => '001ff3a165ff571471603035ca7b4be9',
>> ENCODED
>> > =>
>> > >> > dff681880bb2b23d0351d6656a1dbbb9,}
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegion(MasterSnapshotVerifier.java:167)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:152)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:115)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:156)
>> > >> >         at
>> > >> >
>> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> > >> >         at java.lang.Thread.run(Thread.java:662)
>> > >> > 2015-05-19 06:00:45,505 INFO
>> > >> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: Stop
>> > taking
>> > >> > snapshot={ ss=ss_rich_pin_data_v1 table=rich_pin_data_v1
>> > type=SKIPFLUSH
>> > >> }
>> > >> > because: Failed to take snapshot '{ ss=ss_rich_pin_data_v1
>> > >> > table=rich_pin_data_v1 type=SKIPFLUSH }' due to exception
>> > >> > 2015-05-19 06:00:49,745 WARN org.apache.hadoop.ipc.HBaseServer:
IPC
>> > >> Server
>> > >> > handler 50 on 60000 caught:
>> java.lang.ArrayIndexOutOfBoundsException:
>> > 2
>> > >> >         at java.util.Arrays$ArrayList.get(Arrays.java:3381)
>> > >> >         at
>> > >> > java.util.Collections$UnmodifiableList.get(Collections.java:1152)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99)
>> > >> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> > >> >         at java.lang.reflect.Method.invoke(Method.java:597)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> com.google.protobuf.GeneratedMessage.invokeOrDie(GeneratedMessage.java:1369)
>> > >> >         at
>> > >> >
>> > >>
>> >
>> com.google.protobuf.GeneratedMessage.access$1400(GeneratedMessage.java:57)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> com.google.protobuf.GeneratedMessage$FieldAccessorTable$SingularEnumFieldAccessor.get(GeneratedMessage.java:1670)
>> > >> >         at
>> > >> >
>> > com.google.protobuf.GeneratedMessage.getField(GeneratedMessage.java:162)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> com.google.protobuf.GeneratedMessage.getAllFieldsMutable(GeneratedMessage.java:113)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> com.google.protobuf.GeneratedMessage.getAllFields(GeneratedMessage.java:152)
>> > >> >         at
>> > >> > com.google.protobuf.TextFormat$Printer.print(TextFormat.java:228)
>> > >> >         at
>> > >> >
>> com.google.protobuf.TextFormat$Printer.access$200(TextFormat.java:217)
>> > >> >         at com.google.protobuf.TextFormat.print(TextFormat.java:68)
>> > >> >         at
>> > >> > com.google.protobuf.TextFormat.printToString(TextFormat.java:115)
>> > >> >         at
>> > >> >
>> com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86)
>> > >> >         at
>> > >> >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72)
>> > >> >         at java.lang.String.valueOf(String.java:2826)
>> > >> >         at java.lang.StringBuilder.append(StringBuilder.java:115)
>> > >> >         at
>> > >> >
>> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152)
>> > >> >         at
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304)
>> > >> >         at java.lang.String.valueOf(String.java:2826)
>> > >> >         at java.lang.StringBuilder.append(StringBuilder.java:115)
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message