Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 21108 invoked from network); 3 May 2010 22:12:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 May 2010 22:12:45 -0000 Received: (qmail 93681 invoked by uid 500); 3 May 2010 22:12:44 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 93647 invoked by uid 500); 3 May 2010 22:12:44 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 93639 invoked by uid 99); 3 May 2010 22:12:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 May 2010 22:12:44 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mkurucz@gmail.com designates 74.125.82.48 as permitted sender) Received: from [74.125.82.48] (HELO mail-ww0-f48.google.com) (74.125.82.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 May 2010 22:12:36 +0000 Received: by wwb17 with SMTP id 17so107137wwb.35 for ; Mon, 03 May 2010 15:12:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:date:message-id :subject:from:to:content-type; bh=JmJsX8UZ2odSeHnnkK6gfiFgA5Fd/Rk1UkC/ysQkko8=; b=WJEJq5WsguhuhwAZFVYOTzS5wxZxtDFAmg6nVjNS7yXYrTfXfAQgzPbhZ+7YVNp/mH Tco4fi3WFQiz9SxXoEFKQQUsEj88t2Ci9fzxfUAALKh3wAokandWp1v/sSKlsG7KwIyt ardMJeB3H0B+SMQFyutFWKHqh9OM+UwDEoi8s= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=mRG0Uj20tFHlKMJtyC3hEVmrOPqWp6K1hKcXmKZJD87zz6T9f8S6nSxwAC4yEITd1q Sg5F8SNDQrHunNR22pSSQpATn5BuzKCu0rawS4ijxbItEsInTmAJMKCMa81Zg0jp6kn5 Wvw2URuoHWr6GY1H3/EMw5Qgnv10tucPLr+Wk= MIME-Version: 1.0 Received: by 10.216.86.193 with SMTP id w43mr718746wee.16.1272924736443; Mon, 03 May 2010 15:12:16 -0700 (PDT) Received: by 10.216.178.71 with HTTP; Mon, 3 May 2010 15:12:16 -0700 (PDT) Date: Tue, 4 May 2010 00:12:16 +0200 Message-ID: Subject: hbase.client.retries.number = 1 is bad From: =?ISO-8859-1?Q?Mikl=F3s_Kurucz?= To: hbase-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi! I'm using a fresh version of trunk. I'm experiencing a problem where the invalid region locations are not removed from the cache of HCM. I'm only using scanners on the table and I receive the following errors: 2010-05-03 23:42:52,574 DEBUG org.apache.hadoop.hbase.client.HTable$ClientScanner: Advancing internal scanner to startKey at 'http://hu.gaabi.www/jordania/\x28041022\x29_jord-155_petra.jpg' 2010-05-03 23:42:52,574 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for row in tableName Test5: location server 10.1.3.111:60020, location region name Test5,http://hu.gaabi.www/jordania/\x28041022\x29_jord-155_petra.jpg,1272896369136 SEVERE: Trying to contact region server 10.1.3.111:60020 for region Test5,http://hu.gaabi.www/jordania/\x28041022\x29_jord-155_petra.jpg,1272896369136, row 'http://hu.gaabi.www/jordania/\x28041022\x29_jord-155_petra.jpg', but failed after 1 attempts. Exceptions: java.net.ConnectException: Connection refused Which is expected as the 10.1.3.111:60020 regionserver was offline for hours at that time. The cause of this problem is that I set hbase.client.retries.number to 1 as I don't like the current retry options. In this case the following code at HConnectionManager.java:1061 callable.instantiateServer(tries != 0); will make scanners to always use the cache. This makes hbase.client.retries.number = 1 an unusable option. This is not intentional, am I correct? Am I forced to use the retries, or is there an other option? Also I would like to ask, when is it a good thing to retry an operation? In my experience there exists two kinds of failures 1) org.apache.hadoop.hbase.NotServingRegionException : region is offline This can be due to a compaction, in which case we probably need to wait for a few seconds. Or it can be due to a split, in which case we might need to wait for minutes. Either case I would not want my client to wait for such long times when I could reschedule other things to do in that time. It is also possible that region has been transfered to an other regionserver but that is rare compared to the other cases. 2) java.net.ConnectException : regionserver is offline This is solved as soon as the master can reopen regions on an other regionserver, but still can take minutes. Anyway this exception is also rare(usually) Best regards, Miklos