hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Yet Another LeaseException :-(
Date Mon, 21 May 2012 16:46:06 GMT
Thanks for the analysis.

It shouldn't be difficult to verify your hypothesis.
In the following code:
        } catch (Throwable t) {
          t = translateException(t);
          exceptions.add(t);
You can add a log to show the type of t along with information about
callable.

When LeaseException happens again, it would be easier to correlate logs.

Cheers

On Mon, May 21, 2012 at 7:34 AM, Igal Shilman <igals@wix.com> wrote:

> Hi,
>
> We've noticed in our production cluster (0.90.4-cdh3u3) that from time to
> time some of our map tasks fail due to a LeaseException thrown while
> scanning.
>
> We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout" both set
> to 5 minutes.
>
> Whats strange about this, is the sequences of events that cause the maps to
> fail:
> (Relevant log parts are here: http://pastebin.com/d1yckmz6)
>
> (a) a client calls next(69901879722105864, 100)
> (b) HRegionServer:next tries to call removeLease(69901879722105864) and a
> LeaseException is thrown (lease 69901879722105864 does not exists.)
> (c) few milliseconds later the mapper logs the same error, and terminates
> immediately.
> (d) A minute later we see that the RegionServer$Responder.doRespond fails
> because the stream is closed (our client has died a minute ago)
> (e) Five minutes later (=our lease period) RegionServer's log shows:
> Scanner 69901879722105864 lease expired.
>
> Now that seems pretty odd, especially that (b) happened 5 minutes before
> (e)
>
> This might be possible, IMHO in the following scenario:
> 1. A ScannerCallable wishing to call: next(69901879722105864, 100) is
> passed to getRegionServerWithRetries
>
> 2. RS accepts it, enters next(69901879722105864, 100), and removes the
> lease assosicated with "69901879722105864".
>
> 3 meanwhile getRegionServerWithRetries catches an exception that is not of
> type DoNotRetryIOException (perhaps socket timeout?) while waiting for this
> callable to complete.
>
> getRegionServerWithRetries just silently adds this to a list of exceptions.
>
> 4. Then a retry causes (b), and then a rethrow of a LeaseException (masking
> any previous exceptions that were accumulated in (3)).
>
> Is this scenario seems possible to anyone?
>
> Thanks,
> Igal.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message