Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 08B794321 for ; Wed, 29 Jun 2011 22:29:55 +0000 (UTC) Received: (qmail 89145 invoked by uid 500); 29 Jun 2011 22:29:54 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 89084 invoked by uid 500); 29 Jun 2011 22:29:54 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 89072 invoked by uid 99); 29 Jun 2011 22:29:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Jun 2011 22:29:54 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Jun 2011 22:29:53 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 2A09643A626 for ; Wed, 29 Jun 2011 22:29:33 +0000 (UTC) Date: Wed, 29 Jun 2011 22:29:33 +0000 (UTC) From: "Jean-Daniel Cryans (JIRA)" To: issues@hbase.apache.org Message-ID: <539236603.3978.1309386573168.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <702120079.3410.1307990211676.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Resolved] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-3984. --------------------------------------- Resolution: Fixed Release Note: In trunk: All HRegionInferface methods will now throw a RegionServerStoppedException if it's in that state, whereas we used to only check it for a few methods. SingleServerBulkAssigner will not kill the Master anymore when getting IOEs, instead it will just log an error and the TimeoutMonitor will take care of picking up the pieces. In 0.90: Only a couple of checkOpen calls were added in order to change as less code as possible while still fixing the issue. Hadoop Flags: [Reviewed] Commmitted the 0.90 patch to branch and the other patch to trunk including the fix that Ted pointed to. Thanks guys for the reviews. > CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery > --------------------------------------------------------------------------------- > > Key: HBASE-3984 > URL: https://issues.apache.org/jira/browse/HBASE-3984 > Project: HBase > Issue Type: Bug > Affects Versions: 0.90.3 > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Priority: Blocker > Fix For: 0.90.4 > > Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk-v2.patch, HBASE-3984-trunk.patch > > > After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME. > It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close. > What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira