hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tianying Chang <tych...@gmail.com>
Subject Re: Failed to take snapshot due to some region directory is not found
Date Wed, 20 May 2015 16:26:49 GMT
Just want to report some update back.

I fixed the problem by performing a rolling restart. From the DEBUG logging
on the RS, I can see the DEBUG information pasted below, it is doing
SKIP_FLUSH to take snapshot.  Before rolling restart, none of these got
printout when taking snapshot for this table (Other tables are printing
these information)  One thing to mention is I have to use SKIP_FLUSH,
otherwise, snapshot will still fail, probably because the special traffic
to this cluster. I am not sure if it is just heavy write, since we have
other clusters that take on even higher write that does not have this
problem. It seems to me that the RS are in some weird bad state that it
won't take snapshot, not sure where the bug is though.



015-05-20 16:11:34,230 DEBUG
org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure:
take snapshot without flush memstore first
2015-05-20 16:11:34,230 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Storing region-info for snapshot.
2015-05-20 16:11:34,255 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Creating references for hfiles
2015-05-20 16:11:34,255 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Adding snapshot references for [hdfs://
ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/a60f3b6680ab48728bff61be3d1b419c,
hdfs://
ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/771b8125162f4866a002693152704280,
hdfs://
ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/633f76def221463991611b0b9d007bbd,
hdfs://
ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/82b91072d82c4f0a97960dd9bd65b814,
hdfs://
ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/76dac71663ac40919d227f039307cd6d]
hfiles


Thanks
Tian-Ying


On Tue, May 19, 2015 at 2:26 PM, Esteban Gutierrez <esteban@cloudera.com>
wrote:

> http://pastebin.com or http://gist.github.com work fine.
>
> thanks,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
> On Tue, May 19, 2015 at 2:23 PM, Tianying Chang <tychang@gmail.com> wrote:
>
> > Sure, Esteban. Where is a good place to upload the log?
> >
> > On Tue, May 19, 2015 at 2:01 PM, Esteban Gutierrez <esteban@cloudera.com
> >
> > wrote:
> >
> > > The latest log is very interesting Tianying, but I don't see how is
> > related
> > > to the initial CorruptedSnapshotException since ZKProcedureMemberRpcs
> is
> > > aborting the operation due a timeout. Could be possible for you to
> upload
> > > the HBase master and region server logs to pastebin or some other site?
> > > just around the time you started the snapshot and it failed should be
> > fine.
> > >
> > > thanks,
> > > esteban.
> > >
> > >
> > > --
> > > Cloudera, Inc.
> > >
> > >
> > > On Tue, May 19, 2015 at 1:45 PM, Tianying Chang <tychang@gmail.com>
> > wrote:
> > >
> > > > Matteo
> > > >
> > > >
> > > > By looking at the DEBUG log at RS side, it seems to me that no online
> > > > regions were pickedup. So it seems to me that this call returns 0
> > > regions.
> > > > But I am not sure how that happens. Is there anyway to verify this?
> > > >
> > > >  involvedRegions = getRegionsToSnapshot(snapshot);
> > > >
> > > > 2015-05-19 20:36:22,223 DEBUG
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager:
> > > > cancelling 0 tasks for snapshot
> > > >
> > > >
> > > >
> > > > Thanks
> > > > Tian-Ying
> > > >
> > > > Full log:
> > > >
> > > > 2015-05-19 20:35:46,684 INFO
> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received
> > > procedure
> > > > start children changed event: /hbase/online-snapshot/acquired
> > > > 2015-05-19 20:35:46,686 DEBUG
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager:
> > > > Launching subprocedure for snapshot ss_rich_pin_data_v1 from table
> > > > rich_pin_data_v1
> > > > 2015-05-19 20:36:21,723 INFO
> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received
> > created
> > > > event:/hbase/online-snapshot/abort/ss_rich_pin_data_v1
> > > > 2015-05-19 20:36:21,724 ERROR
> > > > org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating
> foreign
> > > > exception to subprocedure ss_rich_pin_data_v1
> > > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
> > via
> > > > timer-java.util.Timer@5add830c
> > > >
> :org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
> > elapsed!
> > > > Source:Timeout caused Foreign Exception Start:1432067746683,
> > > > End:1432067781681, diff:34998, max:35000 ms
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:95)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:290)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
> > > > at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> > > > Caused by:
> > > >
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
> > elapsed!
> > > > Source:Timeout caused Foreign Exception Start:1432067746683,
> > > > End:1432067781681, diff:34998, max:35000 ms
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71)
> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
> > > > at java.util.TimerThread.run(Timer.java:462)
> > > > 2015-05-19 20:36:21,724 INFO
> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received
> > > procedure
> > > > abort children changed event: /hbase/online-snapshot/abort
> > > > 2015-05-19 20:36:21,726 ERROR
> > > > org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating
> foreign
> > > > exception to subprocedure ss_rich_pin_data_v1
> > > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
> > via
> > > > timer-java.util.Timer@5add830c
> > > >
> :org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
> > elapsed!
> > > > Source:Timeout caused Foreign Exception Start:1432067746683,
> > > > End:1432067781681, diff:34998, max:35000 ms
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:143)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:56)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:111)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:311)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
> > > > at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> > > > Caused by:
> > > >
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
> > elapsed!
> > > > Source:Timeout caused Foreign Exception Start:1432067746683,
> > > > End:1432067781681, diff:34998, max:35000 ms
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71)
> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
> > > > at java.util.TimerThread.run(Timer.java:462)
> > > > 2015-05-19 20:36:21,780 INFO
> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received
> > > procedure
> > > > start children changed event: /hbase/online-snapshot/acquired
> > > > 2015-05-19 20:36:21,784 INFO
> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received
> > > procedure
> > > > abort children changed event: /hbase/online-snapshot/abort
> > > > 2015-05-19 20:36:22,221 ERROR
> > > > org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure
> > > > 'ss_rich_pin_data_v1' aborting due to a ForeignException!
> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException via
> > > > timer-java.util.Timer@25ff6d2a
> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > > > Timeout elapsed! Source:Timeout caused Foreign Exception
> > > > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170)
> > > > at
> > > >
> > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
> > > > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > > > at
> > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> > > > at
> > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> > > > at java.lang.Thread.run(Thread.java:662)
> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > > Timeout
> > > > elapsed! Source:Timeout caused Foreign Exception Start:1432067746687,
> > > > End:1432067781721, diff:35034, max:35000 ms
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
> > > > at java.util.TimerThread.run(Timer.java:462)
> > > > 2015-05-19 20:36:22,223 ERROR
> > > > org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure
> > > > 'ss_rich_pin_data_v1' aborting due to a ForeignException!
> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException via
> > > > timer-java.util.Timer@25ff6d2a
> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > > > Timeout elapsed! Source:Timeout caused Foreign Exception
> > > > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170)
> > > > at
> > > >
> > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
> > > > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > > > at
> > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> > > > at
> > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> > > > at java.lang.Thread.run(Thread.java:662)
> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > > Timeout
> > > > elapsed! Source:Timeout caused Foreign Exception Start:1432067746687,
> > > > End:1432067781721, diff:35034, max:35000 ms
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
> > > > at java.util.TimerThread.run(Timer.java:462)
> > > > 2015-05-19 20:36:22,223 INFO
> > > >
> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure:
> > > > Aborting all online FLUSH snapshot subprocedure task threads for
> > > > 'ss_rich_pin_data_v1' due to error
> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException via
> > > > timer-java.util.Timer@25ff6d2a
> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > > > Timeout elapsed! Source:Timeout caused Foreign Exception
> > > > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170)
> > > > at
> > > >
> > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
> > > > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > > > at
> > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> > > > at
> > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> > > > at java.lang.Thread.run(Thread.java:662)
> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > > Timeout
> > > > elapsed! Source:Timeout caused Foreign Exception Start:1432067746687,
> > > > End:1432067781721, diff:35034, max:35000 ms
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
> > > > at java.util.TimerThread.run(Timer.java:462)
> > > > 2015-05-19 20:36:22,223 DEBUG
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager:
> > > > cancelling 0 tasks for snapshot
> > > >
> > > > On Tue, May 19, 2015 at 1:30 PM, Tianying Chang <tychang@gmail.com>
> > > wrote:
> > > >
> > > > > Matteo
> > > > >
> > > > > We are using hdfs2.0 + HBase94.7.
> > > > >
> > > > > I saw this ArrayIndexOutOfBoundsException: 2 error also. What does
> > that
> > > > > mean?
> > > > >
> > > > > BTW, other tables (but those are smaller in terms of region count)
> in
> > > > this
> > > > > same cluster is able to create snapshot, only this table is
> failing.
> > > > >
> > > > > Thanks
> > > > > Tian-Ying
> > > > >
> > > > > On Tue, May 19, 2015 at 11:50 AM, Matteo Bertozzi <
> > > > theo.bertozzi@gmail.com
> > > > > > wrote:
> > > > >
> > > > >> can you debug the protobuf problem, I think we abort because
we
> are
> > > not
> > > > >> able to write
> > > > >>
> > > > >> 2015-05-19 06:00:49,745 WARN org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > Server
> > > > >> handler 50 on 60000 caught:
> > java.lang.ArrayIndexOutOfBoundsException:
> > > 2
> > > > >>         at java.util.Arrays$ArrayList.get(Arrays.java:3381)
> > > > >>         at
> > > > >> java.util.Collections$UnmodifiableList.get(Collections.java:1152)
> > > > >>         at
> > > > >>
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99)
> > > > >> ...
> > > > >>
> > com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86)
> > > > >>         at
> > > > >>
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72)
> > > > >>         at java.lang.String.valueOf(String.java:2826)
> > > > >>         at java.lang.StringBuilder.append(StringBuilder.java:115)
> > > > >>         at
> > > > >>
> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152)
> > > > >>         at
> > > > >>
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304)
> > > > >>
> > > > >> Matteo
> > > > >>
> > > > >>
> > > > >> On Tue, May 19, 2015 at 11:35 AM, Tianying Chang <
> tychang@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > Actually, I find it does not even print out the debug info
below
> > for
> > > > >> this
> > > > >> > table, other table will print out this logging. So it seems
it
> did
> > > not
> > > > >> > invoke the FlushSnapshotSubprocedure at all.
> > > > >> >
> > > > >> >
> > > > >> >  @Override
> > > > >> >     public Void call() throws Exception {
> > > > >> >       // Taking the region read lock prevents the individual
> > region
> > > > from
> > > > >> > being closed while a
> > > > >> >       // snapshot is in progress.  This is helpful but not
> > > sufficient
> > > > >> for
> > > > >> > preventing races with
> > > > >> >       // snapshots that involve multiple regions and
> > regionservers.
> > > > It
> > > > >> is
> > > > >> > still possible to have
> > > > >> >       // an interleaving such that globally regions are
missing,
> > so
> > > we
> > > > >> > still need the verification
> > > > >> >       // step.
> > > > >> >       LOG.debug("Starting region operation on " + region);
> > > > >> >
> > > > >> > On Tue, May 19, 2015 at 11:26 AM, Tianying Chang <
> > tychang@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hi, Esteban,
> > > > >> > >
> > > > >> > > There is no region split in this cluster, since we
put the
> > region
> > > > size
> > > > >> > > upper bound to be really high to prevent splitting.
> > > > >> > >
> > > > >> > > I think it happens for all the regions of this table.
> > > > >> > >
> > > > >> > > I repeatedly run "hdfs dfs -lsr
> > > > >> > > /hbase/.hbase-snapshot/ss_rich_pin_data_v1"  while
taking
> > > snapshot,
> > > > no
> > > > >> > > region was able to write into this direction. I also
turn on
> > DEBUG
> > > > >> > logging
> > > > >> > > on RS, all RS  just report fail with Timeout, with
no specific
> > > > reason.
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > > Tian-Ying
> > > > >> > >
> > > > >> > > On Tue, May 19, 2015 at 11:06 AM, Esteban Gutierrez
<
> > > > >> > esteban@cloudera.com>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > >> Hi Tianying,
> > > > >> > >>
> > > > >> > >> Is this happening consistently in this region or
is it
> > happening
> > > > >> > randomly
> > > > >> > >> across other regions too? One possibility is that
there was a
> > > split
> > > > >> > going
> > > > >> > >> on at the time you started to take the snapshot
and it
> failed.
> > If
> > > > you
> > > > >> > look
> > > > >> > >> into /hbase/rich_pin_data_v1 can you find a directory
named
> > > > >> > >> dff681880bb2b23d0351d6656a1dbbb9 in there?
> > > > >> > >>
> > > > >> > >> cheers,
> > > > >> > >> esteban.
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> --
> > > > >> > >> Cloudera, Inc.
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> On Mon, May 18, 2015 at 11:12 PM, Tianying Chang
<
> > > > tychang@gmail.com>
> > > > >> > >> wrote:
> > > > >> > >>
> > > > >> > >> > Hi,
> > > > >> > >> >
> > > > >> > >> > We have a cluster that used to be able to
take snapshot.
> But
> > > > >> recently,
> > > > >> > >> one
> > > > >> > >> > table failed due to the error below. Other
tables on the
> same
> > > > >> clusters
> > > > >> > >> are
> > > > >> > >> > fine.
> > > > >> > >> >
> > > > >> > >> > Any idea what could go wrong? Is the table
not healthy?
> But I
> > > run
> > > > >> > hbase
> > > > >> > >> > hbck, it reports cluster healthy.
> > > > >> > >> >
> > > > >> > >> > BTW, we are running 94.7, so we need to take
snapshot of
> the
> > > data
> > > > >> to
> > > > >> > >> export
> > > > >> > >> > to a new cluster of 94.26 as part of upgrade
(and
> eventually
> > > > >> upgrade
> > > > >> > to
> > > > >> > >> > 1.x)
> > > > >> > >> >
> > > > >> > >> > Thanks
> > > > >> > >> > Tian-Ying
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> > 015-05-19 06:00:45,505 ERROR
> > > > >> > >> >
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler:
> > > > Failed
> > > > >> > >> taking
> > > > >> > >> > snapshot { ss=ss_rich_pin_data_v1 table=rich_pin_data_v1
> > > > >> > type=SKIPFLUSH
> > > > >> > >> }
> > > > >> > >> > due to exception:No region directory found
for region:{NAME
> > =>
> > > > >> > >> >
> > > > >>
> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.',
> > > > >> > >> > STARTKEY => '', ENDKEY =>
> '001ff3a165ff571471603035ca7b4be9',
> > > > >> ENCODED
> > > > >> > =>
> > > > >> > >> > dff681880bb2b23d0351d6656a1dbbb9,}
> > > > >> > >> >
> org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException:
> > No
> > > > >> region
> > > > >> > >> > directory found for region:{NAME =>
> > > > >> > >> >
> > > > >>
> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.',
> > > > >> > >> > STARTKEY => '', ENDKEY =>
> '001ff3a165ff571471603035ca7b4be9',
> > > > >> ENCODED
> > > > >> > =>
> > > > >> > >> > dff681880bb2b23d0351d6656a1dbbb9,}
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegion(MasterSnapshotVerifier.java:167)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:152)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:115)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:156)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> >
> > > >
> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > > > >> > >> >         at java.lang.Thread.run(Thread.java:662)
> > > > >> > >> > 2015-05-19 06:00:45,505 INFO
> > > > >> > >> >
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler:
> > > Stop
> > > > >> > taking
> > > > >> > >> > snapshot={ ss=ss_rich_pin_data_v1 table=rich_pin_data_v1
> > > > >> > type=SKIPFLUSH
> > > > >> > >> }
> > > > >> > >> > because: Failed to take snapshot '{ ss=ss_rich_pin_data_v1
> > > > >> > >> > table=rich_pin_data_v1 type=SKIPFLUSH }' due
to exception
> > > > >> > >> > 2015-05-19 06:00:49,745 WARN
> > org.apache.hadoop.ipc.HBaseServer:
> > > > IPC
> > > > >> > >> Server
> > > > >> > >> > handler 50 on 60000 caught:
> > > > >> java.lang.ArrayIndexOutOfBoundsException:
> > > > >> > 2
> > > > >> > >> >         at java.util.Arrays$ArrayList.get(Arrays.java:3381)
> > > > >> > >> >         at
> > > > >> > >> >
> > > java.util.Collections$UnmodifiableList.get(Collections.java:1152)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99)
> > > > >> > >> >         at
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > > > >> Method)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > > >> > >> >         at java.lang.reflect.Method.invoke(Method.java:597)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.google.protobuf.GeneratedMessage.invokeOrDie(GeneratedMessage.java:1369)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.google.protobuf.GeneratedMessage.access$1400(GeneratedMessage.java:57)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.google.protobuf.GeneratedMessage$FieldAccessorTable$SingularEnumFieldAccessor.get(GeneratedMessage.java:1670)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> >
> > > >
> > com.google.protobuf.GeneratedMessage.getField(GeneratedMessage.java:162)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.google.protobuf.GeneratedMessage.getAllFieldsMutable(GeneratedMessage.java:113)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.google.protobuf.GeneratedMessage.getAllFields(GeneratedMessage.java:152)
> > > > >> > >> >         at
> > > > >> > >> >
> > > com.google.protobuf.TextFormat$Printer.print(TextFormat.java:228)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >>
> > com.google.protobuf.TextFormat$Printer.access$200(TextFormat.java:217)
> > > > >> > >> >         at
> > > > com.google.protobuf.TextFormat.print(TextFormat.java:68)
> > > > >> > >> >         at
> > > > >> > >> >
> > > com.google.protobuf.TextFormat.printToString(TextFormat.java:115)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >>
> > com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72)
> > > > >> > >> >         at java.lang.String.valueOf(String.java:2826)
> > > > >> > >> >         at
> > > java.lang.StringBuilder.append(StringBuilder.java:115)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >>
> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152)
> > > > >> > >> >         at
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304)
> > > > >> > >> >         at java.lang.String.valueOf(String.java:2826)
> > > > >> > >> >         at
> > > java.lang.StringBuilder.append(StringBuilder.java:115)
> > > > >> > >> >
> > > > >> > >>
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message