Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 83EE3F534 for ; Wed, 20 Mar 2013 04:07:16 +0000 (UTC) Received: (qmail 88058 invoked by uid 500); 20 Mar 2013 04:07:16 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 87898 invoked by uid 500); 20 Mar 2013 04:07:16 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 87869 invoked by uid 99); 20 Mar 2013 04:07:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Mar 2013 04:07:15 +0000 Date: Wed, 20 Mar 2013 04:07:15 +0000 (UTC) From: "ramkrishna.s.vasudevan (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-8150) the code that handles RAITE on master in 0.94 should not always use the same plan MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607254#comment-13607254 ] ramkrishna.s.vasudevan commented on HBASE-8150: ----------------------------------------------- You mean differentiate RegionAlreadyInTransitionException with normal operations thro successful RPC only. Currently we return two things ALREADY_OPENED and FAILED_OPENING based on the condition. You mean add another return type ? > the code that handles RAITE on master in 0.94 should not always use the same plan > --------------------------------------------------------------------------------- > > Key: HBASE-8150 > URL: https://issues.apache.org/jira/browse/HBASE-8150 > Project: HBase > Issue Type: Bug > Reporter: Sergey Shelukhin > Priority: Minor > > The code in 0.94 AM sets the region plan to point to the same server when retrying the assignment due to RAITE. > {code} > LOG.warn("Failed assignment of " > + state.getRegion().getRegionNameAsString() > + " to " > + plan.getDestination() > + ", trying to assign " > + (regionAlreadyInTransitionException ? "to the same region server" > + " because of RegionAlreadyInTransitionException;" : "elsewhere instead; ") > + "retry=" + i, t); > {code} > However, there's no wait time in the loop that retries the assignment, and if region is being marked failed to open, which may take some time, master can easily exhaust retries in less than half a second, going to the same server every time and getting the same exception (unfortunately I no longer have logs); then the region will be stuck. > Do you think this is worth fixing (for example, by not using the same server here after a few retries, or by adding timed backoff in such cases)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira