hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Lease does not exist exceptions
Date Thu, 20 Sep 2012 15:15:37 GMT
Hi Igal,

I seem to have missed that mail in my search. Thanks for pointing it
out - you are right on there. I commented on the JIRA, it is a nice
improvement.

On Thu, Sep 20, 2012 at 7:52 PM, Igal Shilman <igals@wix.com> wrote:
> Hi,
> Do you mind taking a look at HBASE-6071 ?
>
> It was submitted as a result of this mail (back at May)
> http://mail-archives.apache.org/mod_mbox/hbase-user/201205.mbox/%3CCAFebPXBq9V9BVdzRTNr-MB3a1Lz78SZj6gvP6On0b%2Bajt9StAg%40mail.gmail.com%3E
>
> I've recently submitted logs that (I think) confirms this theory.
>
> Thanks,
> Igal.
>
> On Thu, Sep 20, 2012 at 4:55 PM, Harsh J <harsh@cloudera.com> wrote:
>
>> Hi Daniel,
>>
>> That sounds fine to do (easier a solution, my brain's gotten complex today
>> ha).
>>
>> We should classify the two types of error in the docs for users the
>> way you have here, to indicate what the issue is in each of the error
>> cases - UnknownScannerException and LeaseException. Mind filing a
>> JIRA? :)
>>
>> On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu <daniel.iancu@1and1.ro>
>> wrote:
>> > Thaaank you! I was waiting for this email for months. I've read all the
>> > posts regarding lease timeouts and see that people usually have them for
>> 2
>> > reasons. One, the normal case where the client app does not process the
>> row
>> > fast enough so they get UnknownScannerException and some had the issue
>> below
>> > and get LeaseException instead.
>> >
>> > How about using a try/catch for the
>> >
>> > // Remove lease while its being processed in server; protects against
>> case
>> >       // where processing of request takes > lease expiration time.
>> >       lease = this.leases.removeLease(scannerName);
>> >
>> > and re-throw an IllegalStateException or log a warning message because a
>> > client with and active scanner but no lease does not seem to be in the
>> right
>> > state?
>> >
>> > Just an idea but you know  better.
>> > Daniel
>> >
>> > On 09/20/2012 03:42 PM, Harsh J wrote:
>> >
>> > Hi,
>> >
>> > I hit this today and got down to investigate it and one of my
>> > colleagues discovered this thread. Since I got some more clues, I
>> > thought I'll bump up this thread for good.
>> >
>> > Lucian almost got the issue here. The thing we missed thinking about
>> > is the client retry. The client of HBaseRPC seems to silently retry on
>> > timeouts. So if you apply Lucian's theory below and apply that a
>> > client retry calls next(ID, Rows) yet again, you can construct this
>> > issue:
>> >
>> > - Client calls next(ID, Rows) first time.
>> > - RS receives the handler-sent request, removes lease (to not expire
>> > it during next() call) and begins work.
>> > - RS#next hangs during work (for whatever reason we can assume - large
>> > values or locks or whatever)
>> > - Client times out after a minute, retries (due to default nature).
>> > Retry seems to be silent though?
>> > - New next(ID, Rows) call is invoked. Scanner still exists so no
>> > UnknownScanner is thrown. But when next() tries to remove lease, we
>> > get thrown LeaseException (and the client gets this immediately and
>> > dies) as the other parallel handler has the lease object already
>> > removed and held in its stuck state.
>> > - A few secs/mins later, the original next() unfreezes, adds back
>> > lease to the queue, tries to write back response, runs into
>> > ClosedChannelException as the client had already thrown its original
>> > socket away. End of clients.
>> > - Lease-period expiry later, the lease is now formally removed without
>> > any hitches.
>> >
>> > Ideally, to prevent this, the rpc.timeout must be > lease period as
>> > was pointed out. Since in that case, we'd have waited for X units more
>> > for the original next() to unblock and continue itself and not have
>> > retried. That is how this is avoided, unintentionally, but can still
>> > happen if the next() still takes very long.
>> >
>> > I haven't seen a LeaseException in any other case so far, so maybe we
>> > can improve that exception's message to indicate whats going on in
>> > simpler terms so clients can reconfigure to fix themselves?
>> >
>> > Also we could add in some measures to prevent next()-duping, as that
>> > is never bound to work given the lease-required system. Perhaps when
>> > the next() stores the removed lease, we can store it somewhere global
>> > (like ActiveLeases or summat) and deny next() duping if their
>> > requested lease is already in ActiveLeases? Just ends up giving a
>> > better message, not a solution.
>> >
>> > Hope this helps others who've run into the same issue.
>> >
>> > On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
>> > <jdcryans@apache.org> wrote:
>> >
>> > So you should see the SocketTimeoutException in your *client* logs (in
>> > your case, mappers), not LeaseException. At this point yes you're
>> > going to timeout, but if you spend so much time cycling on the server
>> > side then you shouldn't set a high caching configuration on your
>> > scanner as IO isn't your bottle neck.
>> >
>> > J-D
>> >
>> > On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>> > <lucian.george.iordache@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > The servers have been restarted (I have this configuration for more than
>> a
>> > month, so this is not the problem).
>> > About the stack traces, they show exactly the same, a lot of
>> > ClosedChannelConnections and LeaseExceptions.
>> >
>> > But I found something that could be the problem: hbase.rpc.timeout . This
>> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>> > could happen the next way:
>> > - the mapper makes a scanner.next call to the region server
>> > - the region servers needs more than 60 seconds to execute it (I use
>> > multiple filters, and it could take a lot of time)
>> > - the scan client gets the timeout and cuts the connection
>> > - the region server tries to send the results to the client ==>
>> > ClosedChannelConnection
>> >
>> > I will get a deeper look into it tomorrow. If you have other suggestions,
>> > please let me know!
>> >
>> > Thanks,
>> > Lucian
>> >
>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans
>> > <jdcryans@apache.org>wrote:
>> >
>> > Did you restart the region servers after changing the config?
>> >
>> > Are you sure it's the same exception/stack trace?
>> >
>> > J-D
>> >
>> > On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>> > <lucian.george.iordache@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I have exactly the same problem that Eran had.
>> > But there is something I don't understand: in my case, I have set the
>> >
>> > lease
>> >
>> > time to 240000 (4 minutes). But most of the map tasks that are failing
>> >
>> > run
>> >
>> > about 2 minutes. How is it possible to get a LeaseException if the task
>> >
>> > runs
>> >
>> > less than the configured time for a lease?
>> >
>> > Regards,
>> > Lucian Iordache
>> >
>> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <eran@gigya.com> wrote:
>> >
>> > Perfect! Thanks.
>> >
>> > -eran
>> >
>> >
>> >
>> > On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>> >
>> > wrote:
>> >
>> > hbase.regionserver.lease.period
>> >
>> > Set it bigger than 60000.
>> >
>> > J-D
>> >
>> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <eran@gigya.com> wrote:
>> >
>> > Thanks J-D!
>> > Since my main table is expected to continue growing I guess at some
>> >
>> > point
>> >
>> > even setting the cache size to 1 will not be enough. Is there a way
>> >
>> > to
>> >
>> > configure the lease timeout?
>> >
>> > -eran
>> >
>> >
>> >
>> > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>> >
>> > jdcryans@apache.org
>> >
>> > wrote:
>> >
>> > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <eran@gigya.com>
>> >
>> > wrote:
>> >
>> > Hi J-D,
>> > Thanks for the detailed explanation.
>> > So if I understand correctly the lease we're talking about is a
>> >
>> > scanner
>> >
>> > lease and the timeout is between two scanner calls, correct? I
>> >
>> > think
>> >
>> > that
>> >
>> > make sense because I now realize that jobs that fail (some jobs
>> >
>> > continued
>> >
>> > to
>> > fail even after reducing the number of map tasks as Stack
>> >
>> > suggested)
>> >
>> > use
>> >
>> > filters to fetch relatively few rows out of a very large table,
>> >
>> > so
>> >
>> > they
>> >
>> > could be spending a lot of time on the region server scanning
>> >
>> > rows
>> >
>> > until
>> >
>> > it
>> >
>> > reached my setCaching value which was 1000. Setting the caching
>> >
>> > value
>> >
>> > to
>> >
>> > 1
>> >
>> > seem to allow these job to complete.
>> > I think it has to be the above, since my rows are small, with
>> >
>> > just
>> >
>> > a
>> >
>> > few
>> >
>> > columns and processing them is very quick.
>> >
>> > Excellent!
>> >
>> >
>> > However, there are still a couple ofw thing I don't understand:
>> > 1. What is the difference between setCaching and setBatch?
>> >
>> > * Set the maximum number of values to return for each call to
>> >
>> > next()
>> >
>> > VS
>> >
>> > * Set the number of rows for caching that will be passed to
>> >
>> > scanners.
>> >
>> > The former is useful if you have rows with millions of columns and
>> >
>> > you
>> >
>> > could
>> > setBatch to get only 1000 of them at a time. You could call that
>> >
>> > intra-row
>> >
>> > scanning.
>> >
>> >
>> > 2. Examining the region server logs more closely than I did
>> >
>> > yesterday
>> >
>> > I
>> >
>> > see
>> >
>> > a log of ClosedChannelExceptions in addition to the expired
>> >
>> > leases
>> >
>> > (but
>> >
>> > no
>> >
>> > UnknownScannerException), is that expected? You can see an
>> >
>> > excerpt
>> >
>> > of
>> >
>> > the
>> >
>> > log from one of the region servers here:
>> >
>> > http://pastebin.com/NLcZTzsY
>> >
>> > It means that when the server got to process that client request
>> >
>> > and
>> >
>> > started
>> > reading from the socket, the client was already gone. Killing a
>> >
>> > client
>> >
>> > does
>> >
>> > that (or killing a MR that scans), so does SocketTimeoutException.
>> >
>> > This
>> >
>> > should probably go in the book. We should also print something
>> >
>> > nicer
>> >
>> > :)
>> >
>> > J-D
>> >
>> >
>> >
>> > --
>> > Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J

Mime
View raw message