Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@minotaur.apache.org Received: (qmail 455 invoked from network); 15 Sep 2009 22:44:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 15 Sep 2009 22:44:25 -0000 Received: (qmail 17245 invoked by uid 500); 15 Sep 2009 22:44:25 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 17228 invoked by uid 500); 15 Sep 2009 22:44:25 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 17199 invoked by uid 99); 15 Sep 2009 22:44:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Sep 2009 22:44:25 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Sep 2009 22:44:19 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id DE0C1234C4B1 for ; Tue, 15 Sep 2009 15:43:57 -0700 (PDT) Message-ID: <2096629838.1253054637908.JavaMail.jira@brutus> Date: Tue, 15 Sep 2009 15:43:57 -0700 (PDT) From: "stack (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Commented: (HBASE-1815) HBaseClient can get stuck in an infinite loop while attempting to contact a failed regionserver In-Reply-To: <676212266.1252016817456.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755760#action_12755760 ] stack commented on HBASE-1815: ------------------------------ HBaseClient also has this issue from list: Yeah, this is down in guts of the hadoop rpc we use. Around connection setup it has its own config. that is not well aligned with ours (ours being the retries and pause settings) The maxretriies down in ipc is this.maxRetries = conf.getInt("ipc.client.connect.max.retries", 10); Thats for an IOE other than timeout. For timeout, it does this: } catch (SocketTimeoutException toe) { /* The max number of retries is 45, * which amounts to 20s*45 = 15 minutes retries. */ handleConnectionFailure(timeoutFailures++, 45, toe); Let me file an issue to address the above. The retries should be our retries... and in here it has a hardcoded 1000ms that instead should be our pause.... Not hard to fix. > HBaseClient can get stuck in an infinite loop while attempting to contact a failed regionserver > ----------------------------------------------------------------------------------------------- > > Key: HBASE-1815 > URL: https://issues.apache.org/jira/browse/HBASE-1815 > Project: Hadoop HBase > Issue Type: Bug > Components: client > Affects Versions: 0.20.0 > Environment: Ubuntu Linux (Linux 2.6.24-23-generic #1 SMP Wed Apr 1 21:43:24 UTC 2009 x86_64 GNU/Linux), java version "1.6.0_06", Java(TM) SE Runtime Environment (build 1.6.0_06-b02), Java HotSpot(TM) 64-Bit Server VM (build 10.0-b22, mixed mode) > Reporter: Justin Lynn > Fix For: 0.20.1 > > Attachments: thrift_server_log_excerpt, thrift_server_threaddump, thrift_server_threaddump_1 > > > While using HBase Thrift server, if a regionserver goes down due to shutdown or failure clients will timeout because the thrift server cannot contact the dead regionserver. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.