hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Yet Another LeaseException :-(
Date Tue, 22 May 2012 20:48:26 GMT
Makes sense.
Do you mind opening a JIRA for adding debug log ?

On Tue, May 22, 2012 at 1:42 PM, Igal Shilman <igals@wix.com> wrote:

> Hi Ted,
>
> Thank you for your reply, I've followed your advice, and added a log
> message in the catch block.
> I've been trying to reproduce the problem (tried running sparse scans, long
> job etc'), and it didn't happen yet.
>
> I think that adding a log message there (even at debug level) might be
> useful in other scenarios as well, since some scenarios might silently drop
> previous exceptions as well (some paths in translateException result with
> an exception thrown)
>
> Thanks,
> Igal.
>
> On Mon, May 21, 2012 at 7:46 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > Thanks for the analysis.
> >
> > It shouldn't be difficult to verify your hypothesis.
> > In the following code:
> >        } catch (Throwable t) {
> >
>         t = translateException(t);
> >          exceptions.add(t);
> > You can add a log to show the type of t along with information about
> > callable.
> >
> > When LeaseException happens again, it would be easier to correlate logs.
> >
> > Cheers
> >
> > On Mon, May 21, 2012 at 7:34 AM, Igal Shilman <igals@wix.com> wrote:
> >
> > > Hi,
> > >
> > > We've noticed in our production cluster (0.90.4-cdh3u3) that from time
> to
> > > time some of our map tasks fail due to a LeaseException thrown while
> > > scanning.
> > >
> > > We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout" both
> > set
> > > to 5 minutes.
> > >
> > > Whats strange about this, is the sequences of events that cause the
> maps
> > to
> > > fail:
> > > (Relevant log parts are here: http://pastebin.com/d1yckmz6)
> > >
> > > (a) a client calls next(69901879722105864, 100)
> > > (b) HRegionServer:next tries to call removeLease(69901879722105864)
> and a
> > > LeaseException is thrown (lease 69901879722105864 does not exists.)
> > > (c) few milliseconds later the mapper logs the same error, and
> terminates
> > > immediately.
> > > (d) A minute later we see that the RegionServer$Responder.doRespond
> fails
> > > because the stream is closed (our client has died a minute ago)
> > > (e) Five minutes later (=our lease period) RegionServer's log shows:
> > > Scanner 69901879722105864 lease expired.
> > >
> > > Now that seems pretty odd, especially that (b) happened 5 minutes
> before
> > > (e)
> > >
> > > This might be possible, IMHO in the following scenario:
> > > 1. A ScannerCallable wishing to call: next(69901879722105864, 100) is
> > > passed to getRegionServerWithRetries
> > >
> > > 2. RS accepts it, enters next(69901879722105864, 100), and removes the
> > > lease assosicated with "69901879722105864".
> > >
> > > 3 meanwhile getRegionServerWithRetries catches an exception that is not
> > of
> > > type DoNotRetryIOException (perhaps socket timeout?) while waiting for
> > this
> > > callable to complete.
> > >
> > > getRegionServerWithRetries just silently adds this to a list of
> > exceptions.
> > >
> > > 4. Then a retry causes (b), and then a rethrow of a LeaseException
> > (masking
> > > any previous exceptions that were accumulated in (3)).
> > >
> > > Is this scenario seems possible to anyone?
> > >
> > > Thanks,
> > > Igal.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message