Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B742F17951 for ; Fri, 25 Sep 2015 17:29:25 +0000 (UTC) Received: (qmail 32222 invoked by uid 500); 25 Sep 2015 17:29:20 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 32122 invoked by uid 500); 25 Sep 2015 17:29:20 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 31541 invoked by uid 99); 25 Sep 2015 17:29:20 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Sep 2015 17:29:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 58277C2D0C for ; Fri, 25 Sep 2015 17:21:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.98 X-Spam-Level: ** X-Spam-Status: No, score=2.98 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id leY7vLNwg5CX for ; Fri, 25 Sep 2015 17:21:50 +0000 (UTC) Received: from mail-ig0-f173.google.com (mail-ig0-f173.google.com [209.85.213.173]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 4B6A721249 for ; Fri, 25 Sep 2015 17:21:49 +0000 (UTC) Received: by igcpb10 with SMTP id pb10so16522701igc.1 for ; Fri, 25 Sep 2015 10:21:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=DboB6PXOFWM8IUAvruQMoaSawkaSMqHNY9ai9EsQhIc=; b=S6bOoMeSw1YTcfLyb0AwhjzrkOS/UIV3M7saZgkxpdavVqBTaB/wHpEk7s9bcjY99r xK59wyAccjsLpaLIJmObpJ8NrjzTLpgWIH/EQQNMJxctUIftVZXbHYwycAag/IkdyFbw HSBPPnhATeudVYE13D5VCD7xH5ozENNQlM7ihDSdFXkGiCgk+mJMkZ9HOJJyZ47ZFJSa /ICXSLEvA+Gzfhn5ogRbX9iV4A606S5Uu2x6BgpLCvp0jHIrWeZbbxgvQb65pOP9jxWb Pb1OBGM8yYNSXACmdp1gPZU3Uveh25Dml0w2wYBfjIUrAfTB6A17BZ0BU0y8T+3VguTQ ZFog== X-Gm-Message-State: ALoCoQkRabt3ucr+4AutIHQQZ2WT6CkDRu+bBw7yuVIJh5I2Q43K55biV5UIoLhQbx/I9NdjWfSn MIME-Version: 1.0 X-Received: by 10.50.22.101 with SMTP id c5mr3900853igf.57.1443201701953; Fri, 25 Sep 2015 10:21:41 -0700 (PDT) Received: by 10.107.47.21 with HTTP; Fri, 25 Sep 2015 10:21:41 -0700 (PDT) In-Reply-To: References: Date: Fri, 25 Sep 2015 10:21:41 -0700 Message-ID: Subject: Re: Timeouts on snapshot restore From: Alexandre Normand To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=047d7b10c9cbc8e6ae052095941a --047d7b10c9cbc8e6ae052095941a Content-Type: text/plain; charset=UTF-8 We tried decommissioning the slow node and that still didn't help. We then increased the timeouts to 90 minutes and we had a successful restore after 32 minutes. Better a slow restore than a failed restore. Cheers! On Thu, Sep 24, 2015 at 8:34 PM, Alexandre Normand < alexandre.normand@opower.com> wrote: > Yes, RS-1 was the only data node to be excluded. > > The result of 'hdfs fsck /' is healthy. Blocks are reasonably replicated > and we don't have any missing or corrupted blocks. > > Thanks again. > > On Thu, Sep 24, 2015 at 7:55 PM, Ted Yu wrote: > >> bq. Excluding datanode RS-1:50010 >> >> Was RS-1 the only data node to be excluded in that timeframe ? >> Have you run fsck to see if hdfs is healthy ? >> >> Cheers >> >> On Thu, Sep 24, 2015 at 7:47 PM, Alexandre Normand < >> alexandre.normand@opower.com> wrote: >> >> > Hi Ted, >> > We'll be upgrading to cdh5 in the coming months but we're unfortunately >> > stuck on 0.94.6 at the moment. >> > >> > The RS logs were empty around the time of the failed snapshot restore >> > operation, but the following errors were in the master log. The node >> > 'RS-1' is the only node indicated in the logs. These errors occurred >> > throughout the duration of the snapshot_restore operation. >> > >> > Sep 24, 9:51:41.655 PM INFO org.apache.hadoop.hdfs.DFSClient >> > Exception in createBlockOutputStream >> > java.io.IOException: Bad connect ack with firstBadLink as [RS-1]:50010 >> > at >> > >> > >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1117) >> > at >> > >> > >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040) >> > at >> > >> > >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488) >> > Sep 24, 9:51:41.664 PM INFO org.apache.hadoop.hdfs.DFSClient >> > Abandoning >> > BP-1769805853-namenode1-1354129919031:blk_8350736734896383334_327644360 >> > Sep 24, 9:51:41.678 PM INFO org.apache.hadoop.hdfs.DFSClient >> > Excluding datanode RS-1:50010 >> > Sep 24, 9:52:58.954 PM INFO org.apache.hadoop.hdfs.DFSClient >> > Exception in createBlockOutputStream >> > java.net.SocketTimeoutException: 75000 millis timeout while waiting for >> > channel to be ready for read. ch : >> > java.nio.channels.SocketChannel[connected local=/hmaster1:59726 >> > remote=/RS-1:50010] >> > at >> > >> > >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165) >> > at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:156) >> > at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129) >> > at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:117) >> > at java.io.FilterInputStream.read(FilterInputStream.java:83) >> > at java.io.FilterInputStream.read(FilterInputStream.java:83) >> > at >> > >> > >> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:169) >> > at >> > >> > >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1106) >> > at >> > >> > >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040) >> > at >> > >> > >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488) >> > Sep 24, 9:52:58.956 PM INFO org.apache.hadoop.hdfs.DFSClient >> > Abandoning >> > BP-1769805853-namenode1-1354129919031:blk_-6817802178798905477_327644519 >> > Sep 24, 9:52:59.011 PM INFO org.apache.hadoop.hdfs.DFSClient >> > Excluding datanode RS-1:50010 >> > Sep 24, 9:54:22.963 PM WARN >> > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector >> > Timer already marked completed, ignoring! >> > Sep 24, 9:54:22.964 PM ERROR >> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler >> > Failed taking snapshot { ss=** table=** type=DISABLED } due to >> > exception:org.apache.hadoop.hbase.errorhandling.TimeoutException: >> Timeout >> > elapsed! Source:Timeout caused Foreign Exception Start:1443145459635, >> > End:1443146059635, diff:600000, max:600000 ms >> > >> > Thanks! >> > >> > On Thu, Sep 24, 2015 at 6:34 PM, Ted Yu wrote: >> > >> > > 0.94.6 is really old. There have been quite a few bug fixes / >> > improvements >> > > to snapshot feature since its release. >> > > >> > > The error happens when SnapshotDescription corresponding to >> > > kiji.prod.table.site.DI... >> > > was not found by ProcedureCoordinator. >> > > >> > > bq. does the timeout necessarily mean that the restore failed or >> could it >> > > be still running asynchronously >> > > >> > > Can you check region server logs around the time TimeoutException was >> > > thrown to see which server was the straggler ? >> > > >> > > Thanks >> > > >> > > On Thu, Sep 24, 2015 at 5:13 PM, Alexandre Normand < >> > > alexandre.normand@gmail.com> wrote: >> > > >> > > > Hey, >> > > > We're trying to restore a snapshot of a relatively big table (20TB) >> > using >> > > > hbase 0.94.6-cdh4.5.0 and we're getting timeouts doing so. We >> increased >> > > the >> > > > timeout configurations(hbase.snapshot.master.timeoutMillis, >> > > > hbase.snapshot.region.timeout, >> hbase.snapshot.master.timeout.millis) to >> > > 10 >> > > > minutes but we're still experiencing the timeouts. Here's the error >> and >> > > > stack trace (table name obfuscated just because): >> > > > >> > > > ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: >> > > > org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { >> > > > ss*****-1443136710408 table=******** type=FLUSH } had an error. >> > > > kiji.prod.table.site.DI-1019-1443136710408 not found in proclist [] >> > > > at >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:360) >> > > > at >> > > > >> > org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2075) >> > > > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown >> Source) >> > > > at >> > > > >> > > >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> > > > at java.lang.reflect.Method.invoke(Method.java:606) >> > > > at >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) >> > > > at >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1428) >> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException >> via >> > > > timer-java.util.Timer@8ad0d5c >> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException: >> > > > Timeout elapsed! Source:Timeout caused Foreign Exception >> > > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms >> > > > at >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) >> > > > at >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285) >> > > > at >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:350) >> > > > ... 6 more >> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: >> > > > Timeout elapsed! Source:Timeout caused Foreign Exception >> > > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms >> > > > at >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) >> > > > at java.util.TimerThread.mainLoop(Timer.java:555) >> > > > at java.util.TimerThread.run(Timer.java:505) >> > > > >> > > > >> > > > We could increase the timeout again but we'd like to solicit some >> > > feedback >> > > > before trying that. First, does the timeout necessarily mean that >> the >> > > > restore failed or could it be still running asynchronously and >> > eventually >> > > > completing? What's involved in the snapshot restore that could be >> > useful >> > > in >> > > > informing what timeout value would be appropriate for this >> operation? >> > > > >> > > > Thanks! >> > > > >> > > > -- >> > > > Alex >> > > > >> > > >> > >> > >> > >> > -- >> > *Alexandre Normand* >> > Software Maker >> > Opower >> > >> > > > > -- > *Alexandre Normand* > Software Maker > Opower > -- *Alexandre Normand* Software Maker Opower --047d7b10c9cbc8e6ae052095941a--