Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of
 brian.jeltema@digitalenvoy.net designates 68.64.43.136 as permitted sender)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\))
Subject: Re: snapshot timeouts
From: Brian Jeltema <brian.jeltema@digitalenvoy.net>
In-Reply-To: <16A865A7-1EBA-4A55-8DBB-79A43268D777@digitalenvoy.net>
Date: Wed, 8 Oct 2014 16:23:53 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <90C7FAA6-EAC9-4A78-9D8E-BBF6660EBF06@digitalenvoy.net>
References: <19E7F3DB-325A-4411-ABFA-B6953C70FF19@digitalenvoy.net>
 <CALte62y4PeKuSts3MK4=a1TypSEvY5HnXDuNUSY+XY=VRO052Q@mail.gmail.com>
 <16A865A7-1EBA-4A55-8DBB-79A43268D777@digitalenvoy.net>
To: user@hbase.apache.org

Thanks for the quick responses. I=92ll get back on this later; I =
discovered that HBase didn=92t restart properly
after changing the timeouts, so the second ERROR may be a side-effect of =
that.

I also just discovered that the table in question was not pre-split =
properly, and the region distribution
is screwed up. So I=92ll clean up the mess and try again tomorrow.

Regrets for the possible false alarm

Brian

On Oct 8, 2014, at 3:25 PM, Brian Jeltema =
<brian.jeltema@digitalenvoy.net> wrote:

> Sorry, I usually include that info. HBase version is 0.98. =
hbase.rpc.timeout is the default.
>=20
> When the =91ERROR: Call id=85.=92 occurred, there was no stack trace. =
That was the entire error output.
>=20
> Before I increased the snapshot timeout parameters, the timeout I was =
seeing looked like:
>=20
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: =
Snapshot { ss=3DHost-bdj table=3DHost type=3DFLUSH } had an error.  =
Procedure Host-bdj { waiting=3D[] =
done=3D[host-22.hdfs.foo.net,60020,1410543068459, =
host-24.hdfs.foo.net,60020,1412603246174, =
host-17.hdfs.foo.net,60020,1410543059186, =
host-19.hdfs.foo.net,60020,1412419924491, =
host-20.hdfs.foo.net,60020,1412419942143, =
host-16.hdfs.foo.net,60020,1403178964733, =
host-15.hdfs.foo.net,60020,1403178962029, =
host-21.hdfs.foo.net,60020,1403178959748, =
host-23.hdfs.foo.net,60020,1410543079248, =
host-18.hdfs.foo.net,60020,1410543061865] }
> 	at =
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(Sna=
pshotManager.java:366)
> 	at =
org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2993)
> 	at =
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.ca=
llBlockingMethod(MasterProtos.java:38245)
> 	at =
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> 	at =
org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> 	at =
org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:7=
3)
> 	at =
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at =
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:=
1145)
> 	at =
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java=
:615)
> 	at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via =
timer-java.util.Timer@3097c4e1:org.apache.hadoop.hbase.errorhandling.Timeo=
utException: Timeout elapsed! Source:Timeout caused Foreign Exception =
Start:1412792382137, End:1412792442137, diff:60000, max:60000 ms
> 	at =
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowEx=
ception(ForeignExceptionDispatcher.java:83)
> 	at =
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExcepti=
onIfFailed(TakeSnapshotHandler.java:318)
> 	at =
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(Sna=
pshotManager.java:356)
> 	... 10 more
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: =
Timeout elapsed! Source:Timeout caused Foreign Exception =
Start:1412792382137, End:1412792442137, diff:60000, max:60000 ms
> 	at =
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(Timeo=
utExceptionInjector.java:67)
> 	at java.util.TimerThread.mainLoop(Timer.java:555)
> 	at java.util.TimerThread.run(Timer.java:505)
>=20
> On Oct 8, 2014, at 3:18 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>=20
>> Can you give a bit more information :
>>=20
>> the release of hbase you're using
>> value for hbase.rpc.timeout (looks like you leave it @ default)
>> more of the error (please include stack trace if possible)
>>=20
>> Cheers
>>=20
>> On Wed, Oct 8, 2014 at 12:09 PM, Brian Jeltema <
>> brian.jeltema@foo.net> wrote:
>>=20
>>> I=92m trying to snapshot a moderately large table (3 billion rows, =
but not a
>>> huge amount of data per row).
>>> Those snapshots have been timing out, so I set the following =
parameters to
>>> relatively large values:
>>>=20
>>>    hbase.snapshot.master.timeoutMillis
>>>    hbase.snapshot.region.timeout
>>>    hbase.snapshot.master.timeout.millis
>>>=20
>>> A snapshot attempt then resulted in the terse result:
>>>=20
>>>    ERROR: Call id=3D13, waitTime=3D60060, rpcTimeout=3D60000
>>>=20
>>> A brief review of some of the hbase log files didn=92t reveal =
anything (but
>>> there are many).
>>> How should I pursue getting these snapshots to work?
>>>=20
>>> Brian
>=20