Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B85B6D51A for ; Thu, 20 Sep 2012 15:16:23 +0000 (UTC) Received: (qmail 17245 invoked by uid 500); 20 Sep 2012 15:16:21 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 17188 invoked by uid 500); 20 Sep 2012 15:16:21 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 17180 invoked by uid 99); 20 Sep 2012 15:16:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Sep 2012 15:16:21 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.219.41 as permitted sender) Received: from [209.85.219.41] (HELO mail-oa0-f41.google.com) (209.85.219.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Sep 2012 15:16:17 +0000 Received: by oagj6 with SMTP id j6so2782958oag.14 for ; Thu, 20 Sep 2012 08:15:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-gm-message-state; bh=scAXOAd8cVVrD/mWolUnZ3E6FPFfMf/atiuP7AFBN+U=; b=hLNIgHckncHRi4cv6Mbl5YjpCqMKIETWg8rB1Krghqy29VGQausGqmeF0XZyCX4t5j tW0nxwbk8o5grVcN4zi0OEhYSNRYWVsVhowYiLocsP8DIKbDSUbWMZEKQdqVkhDUnm2x XTvb7AonJrHqCm4Sn7ZJRDvBU/avdfJUB++L2oIX8+hthCMVX3m2yCXWiWjOKSbZjW5d XxKUH8AYZgNEibV1f3EznHhl6EcO/WNisSQ8yMGatVYGQFwUg5SO+JGRCsppE7e+cQ44 86ZzBhbxaP1EBBXCmNOXAI2XuCU00eY3szycDtnmo94YoglVDTaXYbml1BdBdKk5tJxk UXtA== Received: by 10.60.31.170 with SMTP id b10mr1155955oei.107.1348154157104; Thu, 20 Sep 2012 08:15:57 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.11.168 with HTTP; Thu, 20 Sep 2012 08:15:37 -0700 (PDT) In-Reply-To: References: <505B1F4B.20700@1and1.ro> From: Harsh J Date: Thu, 20 Sep 2012 20:45:37 +0530 Message-ID: Subject: Re: Lease does not exist exceptions To: user@hbase.apache.org Cc: Daniel Iancu Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQm0mYhBNTxMYxZoF+aWNzLDnBqxz5SzXIHsXJLJBXwGJIDarKV8g2lk43J1MNhlNdn9yNFA X-Virus-Checked: Checked by ClamAV on apache.org Hi Igal, I seem to have missed that mail in my search. Thanks for pointing it out - you are right on there. I commented on the JIRA, it is a nice improvement. On Thu, Sep 20, 2012 at 7:52 PM, Igal Shilman wrote: > Hi, > Do you mind taking a look at HBASE-6071 ? > > It was submitted as a result of this mail (back at May) > http://mail-archives.apache.org/mod_mbox/hbase-user/201205.mbox/%3CCAFebPXBq9V9BVdzRTNr-MB3a1Lz78SZj6gvP6On0b%2Bajt9StAg%40mail.gmail.com%3E > > I've recently submitted logs that (I think) confirms this theory. > > Thanks, > Igal. > > On Thu, Sep 20, 2012 at 4:55 PM, Harsh J wrote: > >> Hi Daniel, >> >> That sounds fine to do (easier a solution, my brain's gotten complex today >> ha). >> >> We should classify the two types of error in the docs for users the >> way you have here, to indicate what the issue is in each of the error >> cases - UnknownScannerException and LeaseException. Mind filing a >> JIRA? :) >> >> On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu >> wrote: >> > Thaaank you! I was waiting for this email for months. I've read all the >> > posts regarding lease timeouts and see that people usually have them for >> 2 >> > reasons. One, the normal case where the client app does not process the >> row >> > fast enough so they get UnknownScannerException and some had the issue >> below >> > and get LeaseException instead. >> > >> > How about using a try/catch for the >> > >> > // Remove lease while its being processed in server; protects against >> case >> > // where processing of request takes > lease expiration time. >> > lease = this.leases.removeLease(scannerName); >> > >> > and re-throw an IllegalStateException or log a warning message because a >> > client with and active scanner but no lease does not seem to be in the >> right >> > state? >> > >> > Just an idea but you know better. >> > Daniel >> > >> > On 09/20/2012 03:42 PM, Harsh J wrote: >> > >> > Hi, >> > >> > I hit this today and got down to investigate it and one of my >> > colleagues discovered this thread. Since I got some more clues, I >> > thought I'll bump up this thread for good. >> > >> > Lucian almost got the issue here. The thing we missed thinking about >> > is the client retry. The client of HBaseRPC seems to silently retry on >> > timeouts. So if you apply Lucian's theory below and apply that a >> > client retry calls next(ID, Rows) yet again, you can construct this >> > issue: >> > >> > - Client calls next(ID, Rows) first time. >> > - RS receives the handler-sent request, removes lease (to not expire >> > it during next() call) and begins work. >> > - RS#next hangs during work (for whatever reason we can assume - large >> > values or locks or whatever) >> > - Client times out after a minute, retries (due to default nature). >> > Retry seems to be silent though? >> > - New next(ID, Rows) call is invoked. Scanner still exists so no >> > UnknownScanner is thrown. But when next() tries to remove lease, we >> > get thrown LeaseException (and the client gets this immediately and >> > dies) as the other parallel handler has the lease object already >> > removed and held in its stuck state. >> > - A few secs/mins later, the original next() unfreezes, adds back >> > lease to the queue, tries to write back response, runs into >> > ClosedChannelException as the client had already thrown its original >> > socket away. End of clients. >> > - Lease-period expiry later, the lease is now formally removed without >> > any hitches. >> > >> > Ideally, to prevent this, the rpc.timeout must be > lease period as >> > was pointed out. Since in that case, we'd have waited for X units more >> > for the original next() to unblock and continue itself and not have >> > retried. That is how this is avoided, unintentionally, but can still >> > happen if the next() still takes very long. >> > >> > I haven't seen a LeaseException in any other case so far, so maybe we >> > can improve that exception's message to indicate whats going on in >> > simpler terms so clients can reconfigure to fix themselves? >> > >> > Also we could add in some measures to prevent next()-duping, as that >> > is never bound to work given the lease-required system. Perhaps when >> > the next() stores the removed lease, we can store it somewhere global >> > (like ActiveLeases or summat) and deny next() duping if their >> > requested lease is already in ActiveLeases? Just ends up giving a >> > better message, not a solution. >> > >> > Hope this helps others who've run into the same issue. >> > >> > On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans >> > wrote: >> > >> > So you should see the SocketTimeoutException in your *client* logs (in >> > your case, mappers), not LeaseException. At this point yes you're >> > going to timeout, but if you spend so much time cycling on the server >> > side then you shouldn't set a high caching configuration on your >> > scanner as IO isn't your bottle neck. >> > >> > J-D >> > >> > On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache >> > wrote: >> > >> > Hi, >> > >> > The servers have been restarted (I have this configuration for more than >> a >> > month, so this is not the problem). >> > About the stack traces, they show exactly the same, a lot of >> > ClosedChannelConnections and LeaseExceptions. >> > >> > But I found something that could be the problem: hbase.rpc.timeout . This >> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it >> > could happen the next way: >> > - the mapper makes a scanner.next call to the region server >> > - the region servers needs more than 60 seconds to execute it (I use >> > multiple filters, and it could take a lot of time) >> > - the scan client gets the timeout and cuts the connection >> > - the region server tries to send the results to the client ==> >> > ClosedChannelConnection >> > >> > I will get a deeper look into it tomorrow. If you have other suggestions, >> > please let me know! >> > >> > Thanks, >> > Lucian >> > >> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans >> > wrote: >> > >> > Did you restart the region servers after changing the config? >> > >> > Are you sure it's the same exception/stack trace? >> > >> > J-D >> > >> > On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache >> > wrote: >> > >> > Hi all, >> > >> > I have exactly the same problem that Eran had. >> > But there is something I don't understand: in my case, I have set the >> > >> > lease >> > >> > time to 240000 (4 minutes). But most of the map tasks that are failing >> > >> > run >> > >> > about 2 minutes. How is it possible to get a LeaseException if the task >> > >> > runs >> > >> > less than the configured time for a lease? >> > >> > Regards, >> > Lucian Iordache >> > >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner wrote: >> > >> > Perfect! Thanks. >> > >> > -eran >> > >> > >> > >> > On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans > > >> > wrote: >> > >> > hbase.regionserver.lease.period >> > >> > Set it bigger than 60000. >> > >> > J-D >> > >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner wrote: >> > >> > Thanks J-D! >> > Since my main table is expected to continue growing I guess at some >> > >> > point >> > >> > even setting the cache size to 1 will not be enough. Is there a way >> > >> > to >> > >> > configure the lease timeout? >> > >> > -eran >> > >> > >> > >> > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans < >> > >> > jdcryans@apache.org >> > >> > wrote: >> > >> > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner >> > >> > wrote: >> > >> > Hi J-D, >> > Thanks for the detailed explanation. >> > So if I understand correctly the lease we're talking about is a >> > >> > scanner >> > >> > lease and the timeout is between two scanner calls, correct? I >> > >> > think >> > >> > that >> > >> > make sense because I now realize that jobs that fail (some jobs >> > >> > continued >> > >> > to >> > fail even after reducing the number of map tasks as Stack >> > >> > suggested) >> > >> > use >> > >> > filters to fetch relatively few rows out of a very large table, >> > >> > so >> > >> > they >> > >> > could be spending a lot of time on the region server scanning >> > >> > rows >> > >> > until >> > >> > it >> > >> > reached my setCaching value which was 1000. Setting the caching >> > >> > value >> > >> > to >> > >> > 1 >> > >> > seem to allow these job to complete. >> > I think it has to be the above, since my rows are small, with >> > >> > just >> > >> > a >> > >> > few >> > >> > columns and processing them is very quick. >> > >> > Excellent! >> > >> > >> > However, there are still a couple ofw thing I don't understand: >> > 1. What is the difference between setCaching and setBatch? >> > >> > * Set the maximum number of values to return for each call to >> > >> > next() >> > >> > VS >> > >> > * Set the number of rows for caching that will be passed to >> > >> > scanners. >> > >> > The former is useful if you have rows with millions of columns and >> > >> > you >> > >> > could >> > setBatch to get only 1000 of them at a time. You could call that >> > >> > intra-row >> > >> > scanning. >> > >> > >> > 2. Examining the region server logs more closely than I did >> > >> > yesterday >> > >> > I >> > >> > see >> > >> > a log of ClosedChannelExceptions in addition to the expired >> > >> > leases >> > >> > (but >> > >> > no >> > >> > UnknownScannerException), is that expected? You can see an >> > >> > excerpt >> > >> > of >> > >> > the >> > >> > log from one of the region servers here: >> > >> > http://pastebin.com/NLcZTzsY >> > >> > It means that when the server got to process that client request >> > >> > and >> > >> > started >> > reading from the socket, the client was already gone. Killing a >> > >> > client >> > >> > does >> > >> > that (or killing a MR that scans), so does SocketTimeoutException. >> > >> > This >> > >> > should probably go in the book. We should also print something >> > >> > nicer >> > >> > :) >> > >> > J-D >> > >> > >> > >> > -- >> > Harsh J >> > >> > >> >> >> >> -- >> Harsh J >> -- Harsh J