Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 933DA9FB4 for ; Mon, 21 May 2012 14:35:39 +0000 (UTC) Received: (qmail 15848 invoked by uid 500); 21 May 2012 14:35:37 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 15706 invoked by uid 500); 21 May 2012 14:35:37 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 15698 invoked by uid 500); 21 May 2012 14:35:37 -0000 Delivered-To: apmail-hadoop-hbase-user@hadoop.apache.org Received: (qmail 15695 invoked by uid 99); 21 May 2012 14:35:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 May 2012 14:35:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of igals@wix.com designates 74.125.82.48 as permitted sender) Received: from [74.125.82.48] (HELO mail-wg0-f48.google.com) (74.125.82.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 May 2012 14:35:33 +0000 Received: by wgbdq11 with SMTP id dq11so4767172wgb.29 for ; Mon, 21 May 2012 07:35:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type :x-gm-message-state; bh=Hc0YogGmNmtqHkECnnIWzkZ5NKTaVKsGh257VZ5EHP8=; b=JlLvl3yjDEiOdiLTreFufWF1DOTLWzaZY6f/+Wcwio1BTsRRzeIdXG7G47h26PhsCw pQCVXI+j0bycP03dv6dO9CfyxsUYBzEuaEQjWSoxop9zGOpWhO/DH2mtcrU1ZRTLBYwf 89ijJWOVRrq/+llSNgXZlkDxg8aQSW2A1ajRV1NiYBuG+/+Il/lKUYBoJJSUJzf5AZu2 jhUSD33S62pM8pZgZY0MKLONn27TwINOmZ/N8n5d27sKE9HWTT2W3eGuNgoptgb8hAjv WASg8pnDqHclXVwdxDqJDz86/RZFD0c1LLa/hobgXUOkdtvhzqCEN8BmYDI8EcqpnMTi j81Q== Received: by 10.180.80.228 with SMTP id u4mr24512963wix.5.1337610911946; Mon, 21 May 2012 07:35:11 -0700 (PDT) MIME-Version: 1.0 Received: by 10.180.7.5 with HTTP; Mon, 21 May 2012 07:34:51 -0700 (PDT) From: Igal Shilman Date: Mon, 21 May 2012 17:34:51 +0300 Message-ID: Subject: Yet Another LeaseException :-( To: hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=f46d044288d441164904c08ccff5 X-Gm-Message-State: ALoCoQmN1mvwCehJxEXkdfGRxJpQL+d1OBGALPTT9N7+hzMBttoQXSNCtuKgWlwxI2lPVHtQA8ej X-Virus-Checked: Checked by ClamAV on apache.org --f46d044288d441164904c08ccff5 Content-Type: text/plain; charset=ISO-8859-1 Hi, We've noticed in our production cluster (0.90.4-cdh3u3) that from time to time some of our map tasks fail due to a LeaseException thrown while scanning. We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout" both set to 5 minutes. Whats strange about this, is the sequences of events that cause the maps to fail: (Relevant log parts are here: http://pastebin.com/d1yckmz6) (a) a client calls next(69901879722105864, 100) (b) HRegionServer:next tries to call removeLease(69901879722105864) and a LeaseException is thrown (lease 69901879722105864 does not exists.) (c) few milliseconds later the mapper logs the same error, and terminates immediately. (d) A minute later we see that the RegionServer$Responder.doRespond fails because the stream is closed (our client has died a minute ago) (e) Five minutes later (=our lease period) RegionServer's log shows: Scanner 69901879722105864 lease expired. Now that seems pretty odd, especially that (b) happened 5 minutes before (e) This might be possible, IMHO in the following scenario: 1. A ScannerCallable wishing to call: next(69901879722105864, 100) is passed to getRegionServerWithRetries 2. RS accepts it, enters next(69901879722105864, 100), and removes the lease assosicated with "69901879722105864". 3 meanwhile getRegionServerWithRetries catches an exception that is not of type DoNotRetryIOException (perhaps socket timeout?) while waiting for this callable to complete. getRegionServerWithRetries just silently adds this to a list of exceptions. 4. Then a retry causes (b), and then a rethrow of a LeaseException (masking any previous exceptions that were accumulated in (3)). Is this scenario seems possible to anyone? Thanks, Igal. --f46d044288d441164904c08ccff5--