Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 82674 invoked from network); 24 Sep 2007 20:59:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 24 Sep 2007 20:59:15 -0000 Received: (qmail 89577 invoked by uid 500); 24 Sep 2007 20:59:04 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 89501 invoked by uid 500); 24 Sep 2007 20:59:04 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 89492 invoked by uid 99); 24 Sep 2007 20:59:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Sep 2007 13:59:04 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Sep 2007 21:01:24 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id EBC80714212 for ; Mon, 24 Sep 2007 13:58:50 -0700 (PDT) Message-ID: <13217607.1190667530962.JavaMail.jira@brutus> Date: Mon, 24 Sep 2007 13:58:50 -0700 (PDT) From: "dhruba borthakur (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-1076) Periodic checkpointing cannot resume if the secondary name-node fails. In-Reply-To: <10755861.1173240564117.JavaMail.root@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HADOOP-1076: ------------------------------------- Resolution: Fixed Status: Resolved (was: Patch Available) I just committed this. > Periodic checkpointing cannot resume if the secondary name-node fails. > ---------------------------------------------------------------------- > > Key: HADOOP-1076 > URL: https://issues.apache.org/jira/browse/HADOOP-1076 > Project: Hadoop > Issue Type: Bug > Components: dfs > Reporter: Konstantin Shvachko > Assignee: dhruba borthakur > Fix For: 0.15.0 > > Attachments: secondaryRestart4.patch > > > If secondary name-node fails during checkpointing then the primary node will have 2 edits file. > "edits" - is the one which current checkpoint is to be based upon. > "edits.new" - is where new name space edits are currently logged. > The problem is that the primary node cannot do checkpointing until "edits.new" file is in place. > That is, even if the secondary name-node is restarted periodic checkpointing is not going to be resumed. > In fact the primary node will be throwing an exception complaining about the existing "edits.new" > There is only one way to get rid of the edits.new file - to restart the primary name-node. > So in a way if secondary name-node fails then you should restart the whole cluster. > Here is a rather simple modification to the current approach, which we discussed with Dhruba. > When secondary node requests to rollEditLog() the primary node should roll the edit log only if > it has not been already rolled. Otherwise the existing "edits" file will be used for checkpointing > and the primary node will keep accumulating new edits in the "edits.new". > In order to make it work the primary node should also ignore any rollFSImage() requests when it > already started to perform one. Otherwise the new image can become corrupted if two secondary > nodes request to rollFSImage() at the same time. > 2. Also, after the periodic checkpointing patch HADOOP-227 I see pieces of unusable code. > I noticed one data member SecondaryNameNode.localName and at least 4 methods in FSEditLog > that are not used anywhere. We should remove them and others alike if found. > Supporting unusable code is such a waist of time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.