Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 289259BFA for ; Sun, 18 Dec 2011 21:00:53 +0000 (UTC) Received: (qmail 42427 invoked by uid 500); 18 Dec 2011 21:00:52 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 42388 invoked by uid 500); 18 Dec 2011 21:00:52 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 42379 invoked by uid 99); 18 Dec 2011 21:00:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Dec 2011 21:00:52 +0000 X-ASF-Spam-Status: No, hits=-2002.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Dec 2011 21:00:51 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id EDC1111A26E for ; Sun, 18 Dec 2011 21:00:30 +0000 (UTC) Date: Sun, 18 Dec 2011 21:00:30 +0000 (UTC) From: "Eli Collins (Updated) (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1671466380.24580.1324242030975.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1835823652.22954.1324152090644.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HDFS-2702) A single failed name dir can cause the NN to exit MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-2702: ------------------------------ Attachment: hdfs-2702.txt Updated patch with new test class that covers: #1 The NN doesn't exit as long as it has a valid storage dir #2 The NN exits when it no longer has a valid storage dir #3 Removed storage dirs is updated (fails w/o HDFS-2703) > A single failed name dir can cause the NN to exit > -------------------------------------------------- > > Key: HDFS-2702 > URL: https://issues.apache.org/jira/browse/HDFS-2702 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 1.0.0 > Reporter: Eli Collins > Assignee: Eli Collins > Priority: Critical > Attachments: hdfs-2702.txt, hdfs-2702.txt > > > There's a bug in FSEditLog#rollEditLog which results in the NN process exiting if a single name dir has failed. Here's the relevant code: > {code} > close() // So editStreams.size() is 0 > foreach edits dir { > .. > eStream = new ... // Might get an IOE here > editStreams.add(eStream); > } catch (IOException ioe) { > removeEditsForStorageDir(sd); // exits if editStreams.size() <= 1 > } > {code} > If we get an IOException before we've added two edits streams to the list we'll exit, eg if there's an error processing the 1st name dir we'll exit even if there are 4 valid name dirs. The fix is to move the checking out of removeEditsForStorageDir (nee processIOError) or modify it so it can be disabled in some cases, eg here where we don't yet know how many streams are valid. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira