Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0DAEBCF5B for ; Mon, 21 May 2012 16:46:35 +0000 (UTC) Received: (qmail 50638 invoked by uid 500); 21 May 2012 16:46:33 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 50540 invoked by uid 500); 21 May 2012 16:46:33 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 50532 invoked by uid 99); 21 May 2012 16:46:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 May 2012 16:46:33 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 209.85.212.173 as permitted sender) Received: from [209.85.212.173] (HELO mail-wi0-f173.google.com) (209.85.212.173) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 May 2012 16:46:27 +0000 Received: by wibhj6 with SMTP id hj6so2071351wib.2 for ; Mon, 21 May 2012 09:46:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=w2Gmpu6trtxy+JWfOi4D4fVxTdCKN4nmYmsPIZ9FZ0g=; b=f5YaF0Wvq5cN+ajaxhYYWPqc/TRF5tkWcCK0sMMI24Po4mHRQfhuaT1QTKg5hD090F 36u7hsg8FV8IzLVSSVQqeZ9e5GbdBtvAsoOlcepob+G9z3w9BYHE1nQeyoP/akMdsPXs qfTchCdcDS/Wl7+JGez3lHPBC3mpYgUQB4nBDQwSl3x4xzouzzgH0vtyOIqMPG00mAdy 6ANStRv4fhs6wt6hDfUwiVSA99ScNGf559lvho/YKM7oY3BnLfQ7Z0TOmKDlcMY04BrC bCcPVSmznF6VVjFCwK40ByR04Q4f6mbpRtz3minbSHXZR3Tz4yqVjxEKqWBrj/jZyxbZ ZYfw== MIME-Version: 1.0 Received: by 10.216.208.89 with SMTP id p67mr128804weo.155.1337618766980; Mon, 21 May 2012 09:46:06 -0700 (PDT) Received: by 10.216.16.9 with HTTP; Mon, 21 May 2012 09:46:06 -0700 (PDT) In-Reply-To: References: Date: Mon, 21 May 2012 09:46:06 -0700 Message-ID: Subject: Re: Yet Another LeaseException :-( From: Ted Yu To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001636c598817365f104c08ea349 --001636c598817365f104c08ea349 Content-Type: text/plain; charset=ISO-8859-1 Thanks for the analysis. It shouldn't be difficult to verify your hypothesis. In the following code: } catch (Throwable t) { t = translateException(t); exceptions.add(t); You can add a log to show the type of t along with information about callable. When LeaseException happens again, it would be easier to correlate logs. Cheers On Mon, May 21, 2012 at 7:34 AM, Igal Shilman wrote: > Hi, > > We've noticed in our production cluster (0.90.4-cdh3u3) that from time to > time some of our map tasks fail due to a LeaseException thrown while > scanning. > > We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout" both set > to 5 minutes. > > Whats strange about this, is the sequences of events that cause the maps to > fail: > (Relevant log parts are here: http://pastebin.com/d1yckmz6) > > (a) a client calls next(69901879722105864, 100) > (b) HRegionServer:next tries to call removeLease(69901879722105864) and a > LeaseException is thrown (lease 69901879722105864 does not exists.) > (c) few milliseconds later the mapper logs the same error, and terminates > immediately. > (d) A minute later we see that the RegionServer$Responder.doRespond fails > because the stream is closed (our client has died a minute ago) > (e) Five minutes later (=our lease period) RegionServer's log shows: > Scanner 69901879722105864 lease expired. > > Now that seems pretty odd, especially that (b) happened 5 minutes before > (e) > > This might be possible, IMHO in the following scenario: > 1. A ScannerCallable wishing to call: next(69901879722105864, 100) is > passed to getRegionServerWithRetries > > 2. RS accepts it, enters next(69901879722105864, 100), and removes the > lease assosicated with "69901879722105864". > > 3 meanwhile getRegionServerWithRetries catches an exception that is not of > type DoNotRetryIOException (perhaps socket timeout?) while waiting for this > callable to complete. > > getRegionServerWithRetries just silently adds this to a list of exceptions. > > 4. Then a retry causes (b), and then a rethrow of a LeaseException (masking > any previous exceptions that were accumulated in (3)). > > Is this scenario seems possible to anyone? > > Thanks, > Igal. > --001636c598817365f104c08ea349--