Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7EA62116A5 for ; Fri, 13 Jun 2014 20:05:03 +0000 (UTC) Received: (qmail 78162 invoked by uid 500); 13 Jun 2014 20:05:03 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 78125 invoked by uid 500); 13 Jun 2014 20:05:03 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 78114 invoked by uid 99); 13 Jun 2014 20:05:03 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Jun 2014 20:05:03 +0000 Date: Fri, 13 Jun 2014 20:05:03 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-10871) Indefinite OPEN/CLOSE wait on busy RegionServers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031088#comment-14031088 ] Hudson commented on HBASE-10871: -------------------------------- FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #316 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/316/]) HBASE-10871 Indefinite OPEN/CLOSE wait on busy RegionServers (Esteban) (jxiang: rev 7ffc454ccc64f095d8992f03edeb3aacd83de92e) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java > Indefinite OPEN/CLOSE wait on busy RegionServers > ------------------------------------------------ > > Key: HBASE-10871 > URL: https://issues.apache.org/jira/browse/HBASE-10871 > Project: HBase > Issue Type: Improvement > Components: Balancer, master, Region Assignment > Affects Versions: 0.94.6 > Reporter: Harsh J > Assignee: Esteban Gutierrez > Fix For: 0.99.0, 0.94.21, 0.98.4 > > Attachments: HBASE-10871-0.94.v1.patch, HBASE-10871.v0.patch, HBASE-10871.v1.patch > > > We observed a case where, when a specific RS got bombarded by a large amount of regular requests, spiking and filling up its RPC queue, the balancer's invoked unassigns and assigns for regions that dealt with this server entered into an indefinite retry loop. > The regions specifically began waiting in PENDING_CLOSE/PENDING_OPEN states indefinitely cause of the HBase Client RPC from the ServerManager at the master was running into SocketTimeouts. This caused a region unavailability in the server for the affected regions. The timeout monitor retry default of 30m in 0.94's AM compounded the waiting gap further a bit more (this is now 10m in 0.95+'s new AM, and has further retries before we get there, which is good). > Wonder if there's a way to improve this situation generally. PENDING_OPENs may be easy to handle - we can switch them out and move them elsewhere. PENDING_CLOSEs may be a bit more tricky, but there must perhaps at least be a way to "give up" permanently on a movement plan, and letting things be for a while hoping for the RS to recover itself on its own (such that clients also have a chance of getting things to work in the meantime)? -- This message was sent by Atlassian JIRA (v6.2#6252)