Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8D60D17AC7 for ; Wed, 8 Oct 2014 20:26:18 +0000 (UTC) Received: (qmail 61678 invoked by uid 500); 8 Oct 2014 20:26:16 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 61608 invoked by uid 500); 8 Oct 2014 20:26:16 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 61589 invoked by uid 99); 8 Oct 2014 20:26:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Oct 2014 20:26:15 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of brian.jeltema@digitalenvoy.net designates 68.64.43.136 as permitted sender) Received: from [68.64.43.136] (HELO barracuda.digitalenvoy.net) (68.64.43.136) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 08 Oct 2014 20:26:11 +0000 X-ASG-Debug-ID: 1412799949-05f6113214472c0001-ZI2oBf Received: from brian-jeltema.employees.digitalenvoy.net (norc-office.digitalenvoy.net [64.129.218.66]) by barracuda.digitalenvoy.net with ESMTP id f0TnvRN9IZ4oZJu0 (version=TLSv1 cipher=AES128-SHA bits=128 verify=NO) for ; Wed, 08 Oct 2014 16:25:50 -0400 (EDT) X-Barracuda-Envelope-From: brian.jeltema@digitalenvoy.net X-Barracuda-Apparent-Source-IP: 64.129.218.66 X-ASG-Whitelist: Client Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\)) Subject: Re: snapshot timeouts From: Brian Jeltema X-ASG-Orig-Subj: Re: snapshot timeouts In-Reply-To: <16A865A7-1EBA-4A55-8DBB-79A43268D777@digitalenvoy.net> Date: Wed, 8 Oct 2014 16:23:53 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <90C7FAA6-EAC9-4A78-9D8E-BBF6660EBF06@digitalenvoy.net> References: <19E7F3DB-325A-4411-ABFA-B6953C70FF19@digitalenvoy.net> <16A865A7-1EBA-4A55-8DBB-79A43268D777@digitalenvoy.net> To: user@hbase.apache.org X-Mailer: Apple Mail (2.1874) X-Barracuda-Connect: norc-office.digitalenvoy.net[64.129.218.66] X-Barracuda-Start-Time: 1412799949 X-Barracuda-Encrypted: AES128-SHA X-Barracuda-URL: http://barracuda.digitalenvoy.net:8000/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at digitalenvoy.net X-Barracuda-BRTS-Status: 1 X-Virus-Checked: Checked by ClamAV on apache.org Thanks for the quick responses. I=92ll get back on this later; I = discovered that HBase didn=92t restart properly after changing the timeouts, so the second ERROR may be a side-effect of = that. I also just discovered that the table in question was not pre-split = properly, and the region distribution is screwed up. So I=92ll clean up the mess and try again tomorrow. Regrets for the possible false alarm Brian On Oct 8, 2014, at 3:25 PM, Brian Jeltema = wrote: > Sorry, I usually include that info. HBase version is 0.98. = hbase.rpc.timeout is the default. >=20 > When the =91ERROR: Call id=85.=92 occurred, there was no stack trace. = That was the entire error output. >=20 > Before I increased the snapshot timeout parameters, the timeout I was = seeing looked like: >=20 > ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: = Snapshot { ss=3DHost-bdj table=3DHost type=3DFLUSH } had an error. = Procedure Host-bdj { waiting=3D[] = done=3D[host-22.hdfs.foo.net,60020,1410543068459, = host-24.hdfs.foo.net,60020,1412603246174, = host-17.hdfs.foo.net,60020,1410543059186, = host-19.hdfs.foo.net,60020,1412419924491, = host-20.hdfs.foo.net,60020,1412419942143, = host-16.hdfs.foo.net,60020,1403178964733, = host-15.hdfs.foo.net,60020,1403178962029, = host-21.hdfs.foo.net,60020,1403178959748, = host-23.hdfs.foo.net,60020,1410543079248, = host-18.hdfs.foo.net,60020,1410543061865] } > at = org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(Sna= pshotManager.java:366) > at = org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2993) > at = org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.ca= llBlockingMethod(MasterProtos.java:38245) > at = org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > at = org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > at = org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:7= 3) > at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at = java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:= 1145) > at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via = timer-java.util.Timer@3097c4e1:org.apache.hadoop.hbase.errorhandling.Timeo= utException: Timeout elapsed! Source:Timeout caused Foreign Exception = Start:1412792382137, End:1412792442137, diff:60000, max:60000 ms > at = org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowEx= ception(ForeignExceptionDispatcher.java:83) > at = org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExcepti= onIfFailed(TakeSnapshotHandler.java:318) > at = org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(Sna= pshotManager.java:356) > ... 10 more > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: = Timeout elapsed! Source:Timeout caused Foreign Exception = Start:1412792382137, End:1412792442137, diff:60000, max:60000 ms > at = org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(Timeo= utExceptionInjector.java:67) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) >=20 > On Oct 8, 2014, at 3:18 PM, Ted Yu wrote: >=20 >> Can you give a bit more information : >>=20 >> the release of hbase you're using >> value for hbase.rpc.timeout (looks like you leave it @ default) >> more of the error (please include stack trace if possible) >>=20 >> Cheers >>=20 >> On Wed, Oct 8, 2014 at 12:09 PM, Brian Jeltema < >> brian.jeltema@foo.net> wrote: >>=20 >>> I=92m trying to snapshot a moderately large table (3 billion rows, = but not a >>> huge amount of data per row). >>> Those snapshots have been timing out, so I set the following = parameters to >>> relatively large values: >>>=20 >>> hbase.snapshot.master.timeoutMillis >>> hbase.snapshot.region.timeout >>> hbase.snapshot.master.timeout.millis >>>=20 >>> A snapshot attempt then resulted in the terse result: >>>=20 >>> ERROR: Call id=3D13, waitTime=3D60060, rpcTimeout=3D60000 >>>=20 >>> A brief review of some of the hbase log files didn=92t reveal = anything (but >>> there are many). >>> How should I pursue getting these snapshots to work? >>>=20 >>> Brian >=20