Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CA+dReoxYdpSC7OdNLPK2xSLcL9NwUVYos2_GquU0+0wX6R_abA@mail.gmail.com>
References: 
 <CAFF0h_R7yndQC3kjExzYLM6ioJZ8yqrCde3m0xzgWpu07m5EXw@mail.gmail.com>
	<CALte62wZYvJ_ugnYDgyo-AzBY=4FX8cv954nsvOrFtu2mX-snA@mail.gmail.com>
	<CA+dReoza+tvNNj1w-pM4seSuWzz8OarPkY_7BbztiQT_mK7QFw@mail.gmail.com>
	<CALte62yg=8+aqsbUY0LHQudjQLtwbV8CDZNnqb9HNnJ7s=3JGQ@mail.gmail.com>
	<CA+dReoxYdpSC7OdNLPK2xSLcL9NwUVYos2_GquU0+0wX6R_abA@mail.gmail.com>
Date: Fri, 25 Sep 2015 10:21:41 -0700
Message-ID: 
 <CA+dReoyo+cVx9=aQMLykZ9_aQeSznUZmEpjwQGZ0GOffWMrcrQ@mail.gmail.com>
Subject: Re: Timeouts on snapshot restore
From: Alexandre Normand <alexandre.normand@opower.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=047d7b10c9cbc8e6ae052095941a

--047d7b10c9cbc8e6ae052095941a
Content-Type: text/plain; charset=UTF-8

We tried decommissioning the slow node and that still didn't help. We then
increased the timeouts to 90 minutes and we had a successful restore after
32 minutes.

Better a slow restore than a failed restore.

Cheers!

On Thu, Sep 24, 2015 at 8:34 PM, Alexandre Normand <
alexandre.normand@opower.com> wrote:

> Yes, RS-1 was the only data node to be excluded.
>
> The result of 'hdfs fsck /' is healthy. Blocks are reasonably replicated
> and we don't have any missing or corrupted blocks.
>
> Thanks again.
>
> On Thu, Sep 24, 2015 at 7:55 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> bq. Excluding datanode RS-1:50010
>>
>> Was RS-1 the only data node to be excluded in that timeframe ?
>> Have you run fsck to see if hdfs is healthy ?
>>
>> Cheers
>>
>> On Thu, Sep 24, 2015 at 7:47 PM, Alexandre Normand <
>> alexandre.normand@opower.com> wrote:
>>
>> > Hi Ted,
>> > We'll be upgrading to cdh5 in the coming months but we're unfortunately
>> > stuck on 0.94.6 at the moment.
>> >
>> > The RS logs were empty around the time of the failed snapshot restore
>> > operation, but the following errors were in the master log.  The node
>> > 'RS-1' is the only node indicated in the logs. These errors occurred
>> > throughout the duration of the snapshot_restore operation.
>> >
>> > Sep 24, 9:51:41.655 PM INFO org.apache.hadoop.hdfs.DFSClient
>> > Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as [RS-1]:50010
>> > at
>> >
>> >
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1117)
>> > at
>> >
>> >
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040)
>> > at
>> >
>> >
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488)
>> > Sep 24, 9:51:41.664 PM INFO org.apache.hadoop.hdfs.DFSClient
>> > Abandoning
>> > BP-1769805853-namenode1-1354129919031:blk_8350736734896383334_327644360
>> > Sep 24, 9:51:41.678 PM INFO org.apache.hadoop.hdfs.DFSClient
>> > Excluding datanode RS-1:50010
>> > Sep 24, 9:52:58.954 PM INFO org.apache.hadoop.hdfs.DFSClient
>> > Exception in createBlockOutputStream
>> > java.net.SocketTimeoutException: 75000 millis timeout while waiting for
>> > channel to be ready for read. ch :
>> > java.nio.channels.SocketChannel[connected local=/hmaster1:59726
>> > remote=/RS-1:50010]
>> > at
>> >
>> >
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
>> > at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:156)
>> > at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>> > at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:117)
>> > at java.io.FilterInputStream.read(FilterInputStream.java:83)
>> > at java.io.FilterInputStream.read(FilterInputStream.java:83)
>> > at
>> >
>> >
>> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:169)
>> > at
>> >
>> >
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1106)
>> > at
>> >
>> >
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040)
>> > at
>> >
>> >
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488)
>> > Sep 24, 9:52:58.956 PM INFO org.apache.hadoop.hdfs.DFSClient
>> > Abandoning
>> > BP-1769805853-namenode1-1354129919031:blk_-6817802178798905477_327644519
>> > Sep 24, 9:52:59.011 PM INFO org.apache.hadoop.hdfs.DFSClient
>> > Excluding datanode RS-1:50010
>> > Sep 24, 9:54:22.963 PM WARN
>> > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector
>> > Timer already marked completed, ignoring!
>> > Sep 24, 9:54:22.964 PM ERROR
>> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler
>> > Failed taking snapshot { ss=** table=** type=DISABLED } due to
>> > exception:org.apache.hadoop.hbase.errorhandling.TimeoutException:
>> Timeout
>> > elapsed! Source:Timeout caused Foreign Exception Start:1443145459635,
>> > End:1443146059635, diff:600000, max:600000 ms
>> >
>> > Thanks!
>> >
>> > On Thu, Sep 24, 2015 at 6:34 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> >
>> > > 0.94.6 is really old. There have been quite a few bug fixes /
>> > improvements
>> > > to snapshot feature since its release.
>> > >
>> > > The error happens when SnapshotDescription corresponding to
>> > > kiji.prod.table.site.DI...
>> > > was not found by ProcedureCoordinator.
>> > >
>> > > bq. does the timeout necessarily mean that the restore failed or
>> could it
>> > > be still running asynchronously
>> > >
>> > > Can you check region server logs around the time TimeoutException was
>> > > thrown to see which server was the straggler ?
>> > >
>> > > Thanks
>> > >
>> > > On Thu, Sep 24, 2015 at 5:13 PM, Alexandre Normand <
>> > > alexandre.normand@gmail.com> wrote:
>> > >
>> > > > Hey,
>> > > > We're trying to restore a snapshot of a relatively big table (20TB)
>> > using
>> > > > hbase 0.94.6-cdh4.5.0 and we're getting timeouts doing so. We
>> increased
>> > > the
>> > > > timeout configurations(hbase.snapshot.master.timeoutMillis,
>> > > > hbase.snapshot.region.timeout,
>> hbase.snapshot.master.timeout.millis) to
>> > > 10
>> > > > minutes but we're still experiencing the timeouts. Here's the error
>> and
>> > > > stack trace (table name obfuscated just because):
>> > > >
>> > > > ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException:
>> > > > org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot {
>> > > > ss*****-1443136710408 table=******** type=FLUSH } had an error.
>> > > > kiji.prod.table.site.DI-1019-1443136710408 not found in proclist []
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:360)
>> > > >         at
>> > > >
>> > org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2075)
>> > > >         at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown
>> Source)
>> > > >         at
>> > > >
>> > >
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > > >         at java.lang.reflect.Method.invoke(Method.java:606)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1428)
>> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException
>> via
>> > > > timer-java.util.Timer@8ad0d5c
>> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException:
>> > > > Timeout elapsed! Source:Timeout caused Foreign Exception
>> > > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:350)
>> > > >         ... 6 more
>> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
>> > > > Timeout elapsed! Source:Timeout caused Foreign Exception
>> > > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
>> > > >         at java.util.TimerThread.mainLoop(Timer.java:555)
>> > > >         at java.util.TimerThread.run(Timer.java:505)
>> > > >
>> > > >
>> > > > We could increase the timeout again but we'd like to solicit some
>> > > feedback
>> > > > before trying that. First, does the timeout necessarily mean that
>> the
>> > > > restore failed or could it be still running asynchronously and
>> > eventually
>> > > > completing? What's involved in the snapshot restore that could be
>> > useful
>> > > in
>> > > > informing what timeout value would be appropriate for this
>> operation?
>> > > >
>> > > > Thanks!
>> > > >
>> > > > --
>> > > > Alex
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > *Alexandre Normand*
>> > Software Maker
>> > Opower <http://www.opower.com>
>> >
>>
>
>
>
> --
> *Alexandre Normand*
> Software Maker
> Opower <http://www.opower.com>
>


-- 
*Alexandre Normand*
Software Maker
Opower <http://www.opower.com>

--047d7b10c9cbc8e6ae052095941a--