hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Iancu <daniel.ia...@1and1.ro>
Subject Re: Lease does not exist exceptions
Date Mon, 24 Sep 2012 09:25:54 GMT

Here it is:

https://issues.apache.org/jira/browse/HBASE-6856


On 09/21/2012 08:42 PM, Harsh J wrote:
> Daniel,
>
> Nice follow up! We could add some notes around these to the doc as
> well. Please do post back a JIRA link once you've filed it.
>
> On Thu, Sep 20, 2012 at 8:15 PM, Daniel Iancu <daniel.iancu@1and1.ro> wrote:
>> Hi Harsh
>>
>> I've forget to mention that LE happens in the context of a slow internal
>> scanner. So maybe the problem is elsewhere (bad schema design, hardware
>> issue etc) and this is just a consequence (or symptom)  but still it is a
>> problem. In our case the scan was very slow when we were using filters (even
>> the standard ones). When we dropped the filters and transfered the entire
>> row (I know it is not recommended) we haven't see this exception anymore and
>> the overall performance of MR jobs improved since any LE was crashing the
>> task attempt.
>>
>> OK, I'll a open a JIRA for this.
>>
>> Regards,
>> Daniel
>>
>>
>>
>>
>>
>> On 09/20/2012 04:55 PM, Harsh J wrote:
>>> Hi Daniel,
>>>
>>> That sounds fine to do (easier a solution, my brain's gotten complex today
>>> ha).
>>>
>>> We should classify the two types of error in the docs for users the
>>> way you have here, to indicate what the issue is in each of the error
>>> cases - UnknownScannerException and LeaseException. Mind filing a
>>> JIRA? :)
>>>
>>> On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu <daniel.iancu@1and1.ro>
>>> wrote:
>>>> Thaaank you! I was waiting for this email for months. I've read all the
>>>> posts regarding lease timeouts and see that people usually have them for
>>>> 2
>>>> reasons. One, the normal case where the client app does not process the
>>>> row
>>>> fast enough so they get UnknownScannerException and some had the issue
>>>> below
>>>> and get LeaseException instead.
>>>>
>>>> How about using a try/catch for the
>>>>
>>>> // Remove lease while its being processed in server; protects against
>>>> case
>>>>         // where processing of request takes > lease expiration time.
>>>>         lease = this.leases.removeLease(scannerName);
>>>>
>>>> and re-throw an IllegalStateException or log a warning message because a
>>>> client with and active scanner but no lease does not seem to be in the
>>>> right
>>>> state?
>>>>
>>>> Just an idea but you know  better.
>>>> Daniel
>>>>
>>>> On 09/20/2012 03:42 PM, Harsh J wrote:
>>>>
>>>> Hi,
>>>>
>>>> I hit this today and got down to investigate it and one of my
>>>> colleagues discovered this thread. Since I got some more clues, I
>>>> thought I'll bump up this thread for good.
>>>>
>>>> Lucian almost got the issue here. The thing we missed thinking about
>>>> is the client retry. The client of HBaseRPC seems to silently retry on
>>>> timeouts. So if you apply Lucian's theory below and apply that a
>>>> client retry calls next(ID, Rows) yet again, you can construct this
>>>> issue:
>>>>
>>>> - Client calls next(ID, Rows) first time.
>>>> - RS receives the handler-sent request, removes lease (to not expire
>>>> it during next() call) and begins work.
>>>> - RS#next hangs during work (for whatever reason we can assume - large
>>>> values or locks or whatever)
>>>> - Client times out after a minute, retries (due to default nature).
>>>> Retry seems to be silent though?
>>>> - New next(ID, Rows) call is invoked. Scanner still exists so no
>>>> UnknownScanner is thrown. But when next() tries to remove lease, we
>>>> get thrown LeaseException (and the client gets this immediately and
>>>> dies) as the other parallel handler has the lease object already
>>>> removed and held in its stuck state.
>>>> - A few secs/mins later, the original next() unfreezes, adds back
>>>> lease to the queue, tries to write back response, runs into
>>>> ClosedChannelException as the client had already thrown its original
>>>> socket away. End of clients.
>>>> - Lease-period expiry later, the lease is now formally removed without
>>>> any hitches.
>>>>
>>>> Ideally, to prevent this, the rpc.timeout must be > lease period as
>>>> was pointed out. Since in that case, we'd have waited for X units more
>>>> for the original next() to unblock and continue itself and not have
>>>> retried. That is how this is avoided, unintentionally, but can still
>>>> happen if the next() still takes very long.
>>>>
>>>> I haven't seen a LeaseException in any other case so far, so maybe we
>>>> can improve that exception's message to indicate whats going on in
>>>> simpler terms so clients can reconfigure to fix themselves?
>>>>
>>>> Also we could add in some measures to prevent next()-duping, as that
>>>> is never bound to work given the lease-required system. Perhaps when
>>>> the next() stores the removed lease, we can store it somewhere global
>>>> (like ActiveLeases or summat) and deny next() duping if their
>>>> requested lease is already in ActiveLeases? Just ends up giving a
>>>> better message, not a solution.
>>>>
>>>> Hope this helps others who've run into the same issue.
>>>>
>>>> On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
>>>> <jdcryans@apache.org> wrote:
>>>>
>>>> So you should see the SocketTimeoutException in your *client* logs (in
>>>> your case, mappers), not LeaseException. At this point yes you're
>>>> going to timeout, but if you spend so much time cycling on the server
>>>> side then you shouldn't set a high caching configuration on your
>>>> scanner as IO isn't your bottle neck.
>>>>
>>>> J-D
>>>>
>>>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>>>> <lucian.george.iordache@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> The servers have been restarted (I have this configuration for more than
>>>> a
>>>> month, so this is not the problem).
>>>> About the stack traces, they show exactly the same, a lot of
>>>> ClosedChannelConnections and LeaseExceptions.
>>>>
>>>> But I found something that could be the problem: hbase.rpc.timeout . This
>>>> defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>>>> could happen the next way:
>>>> - the mapper makes a scanner.next call to the region server
>>>> - the region servers needs more than 60 seconds to execute it (I use
>>>> multiple filters, and it could take a lot of time)
>>>> - the scan client gets the timeout and cuts the connection
>>>> - the region server tries to send the results to the client ==>
>>>> ClosedChannelConnection
>>>>
>>>> I will get a deeper look into it tomorrow. If you have other suggestions,
>>>> please let me know!
>>>>
>>>> Thanks,
>>>> Lucian
>>>>
>>>> On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans
>>>> <jdcryans@apache.org>wrote:
>>>>
>>>> Did you restart the region servers after changing the config?
>>>>
>>>> Are you sure it's the same exception/stack trace?
>>>>
>>>> J-D
>>>>
>>>> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>>>> <lucian.george.iordache@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I have exactly the same problem that Eran had.
>>>> But there is something I don't understand: in my case, I have set the
>>>>
>>>> lease
>>>>
>>>> time to 240000 (4 minutes). But most of the map tasks that are failing
>>>>
>>>> run
>>>>
>>>> about 2 minutes. How is it possible to get a LeaseException if the task
>>>>
>>>> runs
>>>>
>>>> less than the configured time for a lease?
>>>>
>>>> Regards,
>>>> Lucian Iordache
>>>>
>>>> On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <eran@gigya.com> wrote:
>>>>
>>>> Perfect! Thanks.
>>>>
>>>> -eran
>>>>
>>>>
>>>>
>>>> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>>>>
>>>> wrote:
>>>>
>>>> hbase.regionserver.lease.period
>>>>
>>>> Set it bigger than 60000.
>>>>
>>>> J-D
>>>>
>>>> On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <eran@gigya.com> wrote:
>>>>
>>>> Thanks J-D!
>>>> Since my main table is expected to continue growing I guess at some
>>>>
>>>> point
>>>>
>>>> even setting the cache size to 1 will not be enough. Is there a way
>>>>
>>>> to
>>>>
>>>> configure the lease timeout?
>>>>
>>>> -eran
>>>>
>>>>
>>>>
>>>> On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>>>>
>>>> jdcryans@apache.org
>>>>
>>>> wrote:
>>>>
>>>> On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <eran@gigya.com>
>>>>
>>>> wrote:
>>>>
>>>> Hi J-D,
>>>> Thanks for the detailed explanation.
>>>> So if I understand correctly the lease we're talking about is a
>>>>
>>>> scanner
>>>>
>>>> lease and the timeout is between two scanner calls, correct? I
>>>>
>>>> think
>>>>
>>>> that
>>>>
>>>> make sense because I now realize that jobs that fail (some jobs
>>>>
>>>> continued
>>>>
>>>> to
>>>> fail even after reducing the number of map tasks as Stack
>>>>
>>>> suggested)
>>>>
>>>> use
>>>>
>>>> filters to fetch relatively few rows out of a very large table,
>>>>
>>>> so
>>>>
>>>> they
>>>>
>>>> could be spending a lot of time on the region server scanning
>>>>
>>>> rows
>>>>
>>>> until
>>>>
>>>> it
>>>>
>>>> reached my setCaching value which was 1000. Setting the caching
>>>>
>>>> value
>>>>
>>>> to
>>>>
>>>> 1
>>>>
>>>> seem to allow these job to complete.
>>>> I think it has to be the above, since my rows are small, with
>>>>
>>>> just
>>>>
>>>> a
>>>>
>>>> few
>>>>
>>>> columns and processing them is very quick.
>>>>
>>>> Excellent!
>>>>
>>>>
>>>> However, there are still a couple ofw thing I don't understand:
>>>> 1. What is the difference between setCaching and setBatch?
>>>>
>>>> * Set the maximum number of values to return for each call to
>>>>
>>>> next()
>>>>
>>>> VS
>>>>
>>>> * Set the number of rows for caching that will be passed to
>>>>
>>>> scanners.
>>>>
>>>> The former is useful if you have rows with millions of columns and
>>>>
>>>> you
>>>>
>>>> could
>>>> setBatch to get only 1000 of them at a time. You could call that
>>>>
>>>> intra-row
>>>>
>>>> scanning.
>>>>
>>>>
>>>> 2. Examining the region server logs more closely than I did
>>>>
>>>> yesterday
>>>>
>>>> I
>>>>
>>>> see
>>>>
>>>> a log of ClosedChannelExceptions in addition to the expired
>>>>
>>>> leases
>>>>
>>>> (but
>>>>
>>>> no
>>>>
>>>> UnknownScannerException), is that expected? You can see an
>>>>
>>>> excerpt
>>>>
>>>> of
>>>>
>>>> the
>>>>
>>>> log from one of the region servers here:
>>>>
>>>> http://pastebin.com/NLcZTzsY
>>>>
>>>> It means that when the server got to process that client request
>>>>
>>>> and
>>>>
>>>> started
>>>> reading from the socket, the client was already gone. Killing a
>>>>
>>>> client
>>>>
>>>> does
>>>>
>>>> that (or killing a MR that scans), so does SocketTimeoutException.
>>>>
>>>> This
>>>>
>>>> should probably go in the book. We should also print something
>>>>
>>>> nicer
>>>>
>>>> :)
>>>>
>>>> J-D
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>>
>>>
>>> --
>>> Harsh J
>>
>
>
> --
> Harsh J


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message