Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C123B9148 for ; Thu, 5 Apr 2012 22:20:49 +0000 (UTC) Received: (qmail 72630 invoked by uid 500); 5 Apr 2012 22:20:49 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 72575 invoked by uid 500); 5 Apr 2012 22:20:49 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 72566 invoked by uid 99); 5 Apr 2012 22:20:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2012 22:20:49 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2012 22:20:46 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 83C4A35BC78 for ; Thu, 5 Apr 2012 22:20:25 +0000 (UTC) Date: Thu, 5 Apr 2012 22:20:25 +0000 (UTC) From: "Colin Patrick McCabe (Commented) (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1008290088.20139.1333664425549.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <837427371.11003.1330025088720.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HDFS-3004) Implement Recovery Mode MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247767#comment-13247767 ] Colin Patrick McCabe commented on HDFS-3004: -------------------------------------------- Todd: > this isn't compiling anymore... Sigh. Will rebase on trunk... again. Nicholas: > Since the recover mode may cause data lost, we should prompt and warn the user in the very beginning. We prompt the user before doing anything destructive, unless the -autoChooseDefault option is enabled. > What happen if "-autoChooseDefault" is run with other options or standalone? There are no other options for recovery mode except autoChooseDefault. Check the usage or run -h for more information. > Why remove JournalStream? It's deadcode which does nothing. > Please change RequestStop to RequestStopException. It is better to add an error > message to it. Also, please add javadoc to describe what does it mean. Ok. > Why askOperator(..) belongs to FSEditLogLoader but not RecoveryContext? Because it relates to FSEditLogLoader, not to RecoveryContext. > Could you rename RecoveryContext to something related to image/edit, say ImageEditRecovery. I suppose MetaRecoveryContext could work. This would avoid confusion with "name node lease recovery" or "datanode recovery." Similarly we could add "meta" before some other recovery-related things. > Implement Recovery Mode > ----------------------- > > Key: HDFS-3004 > URL: https://issues.apache.org/jira/browse/HDFS-3004 > Project: Hadoop HDFS > Issue Type: New Feature > Components: tools > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004__namenode_recovery_tool.txt > > > When the NameNode metadata is corrupt for some reason, we want to be able to fix it. Obviously, we would prefer never to get in this case. In a perfect world, we never would. However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations. In the past we have had to correct it manually, which is time-consuming and which can result in downtime. > Recovery mode is initialized by the system administrator. When the NameNode starts up in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits log, and then write out a new image. Then it will shut down. > Unlike in the normal startup process, the recovery mode startup process will be interactive. When the NameNode finds something that is inconsistent, it will prompt the operator as to what it should do. The operator can also choose to take the first option for all prompts by starting up with the '-f' flag, or typing 'a' at one of the prompts. > I have reused as much code as possible from the NameNode in this tool. Hopefully, the effort that was spent developing this will also make the NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira