Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 76007D542 for ; Mon, 29 Oct 2012 21:26:14 +0000 (UTC) Received: (qmail 89715 invoked by uid 500); 29 Oct 2012 21:26:14 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 89672 invoked by uid 500); 29 Oct 2012 21:26:13 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 89633 invoked by uid 99); 29 Oct 2012 21:26:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Oct 2012 21:26:13 +0000 Date: Mon, 29 Oct 2012 21:26:13 +0000 (UTC) From: "Todd Lipcon (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1176095560.41020.1351545973909.JavaMail.jiratomcat@arcas> In-Reply-To: <1193946161.41007.1351545972805.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (HDFS-4128) 2NN gets stuck in inconsistent state if edit log replay fails in the middle MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486382#comment-13486382 ] Todd Lipcon commented on HDFS-4128: ----------------------------------- The issue is that it's starting replay at the beginning of the next full segment, instead of starting with the previous half-replayed segment and skipping forward to the correct txid. We could either fix this, or change the 2NN so that if edit log replay fails, it aborts itself entirely (given such errors are generally going to just happen again on the next attempt, it's probably better to fail hard so an admin notices, instead of retrying forever) > 2NN gets stuck in inconsistent state if edit log replay fails in the middle > --------------------------------------------------------------------------- > > Key: HDFS-4128 > URL: https://issues.apache.org/jira/browse/HDFS-4128 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 2.0.2-alpha > Reporter: Todd Lipcon > > We saw the following issue in a cluster: > - The 2NN downloads an edit log segment: > {code} > 2012-10-29 12:30:57,433 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading /xxxxxxx/current/edits_0000000000049136809-0000000000049176162 expecting start txid #49136809 > {code} > - It fails in the middle of replay due to an OOME: > {code} > 2012-10-29 12:31:21,021 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation AddOp [length=0, path=/xxxxxxxx > java.lang.OutOfMemoryError: Java heap space > {code} > - Future checkpoints then fail because the prior edit log replay only got halfway through the stream: > {code} > 2012-10-29 12:32:21,214 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading /xxxxx/current/edits_0000000000049176163-0000000000049177224 expecting start txid #49144432 > 2012-10-29 12:32:21,216 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint > java.io.IOException: There appears to be a gap in the edit log. We expected txid 49144432, but got txid 49176163. > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira