Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A05EA9D13 for ; Sat, 3 Mar 2012 01:12:21 +0000 (UTC) Received: (qmail 92464 invoked by uid 500); 3 Mar 2012 01:12:21 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 92423 invoked by uid 500); 3 Mar 2012 01:12:21 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 92415 invoked by uid 99); 3 Mar 2012 01:12:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Mar 2012 01:12:21 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Mar 2012 01:12:18 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 890C264B9 for ; Sat, 3 Mar 2012 01:11:57 +0000 (UTC) Date: Sat, 3 Mar 2012 01:11:57 +0000 (UTC) From: "Colin Patrick McCabe (Updated) (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <198012217.16248.1330737117562.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <837427371.11003.1330025088720.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HDFS-3004) Create Offline NameNode recovery tool MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HDFS-3004?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3004: --------------------------------------- Attachment: (was: HDFS-3004.001.patch) =20 > Create Offline NameNode recovery tool > ------------------------------------- > > Key: HDFS-3004 > URL: https://issues.apache.org/jira/browse/HDFS-3004 > Project: Hadoop HDFS > Issue Type: New Feature > Components: tools > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-3004.patch, HDFS-3004__namenode_recovery_tool.t= xt > > > We've been talking about creating a tool which can process NameNode edit = logs and image files offline. > This tool would be similar to a fsck for a conventional filesystem. It w= ould detect inconsistencies and malformed data. In cases where it was poss= ible, and the operator asked for it, it would try to correct the inconsiste= ncy. > It's probably better to call this "nameNodeRecovery" or similar, rather t= han "fsck," since we already have a separate and unrelated mechanism which = we refer to as fsck. > The use case here is that the NameNode data is corrupt for some reason, a= nd we want to fix it. Obviously, we would prefer never to get in this case= . In a perfect world, we never would. However, bad data on disk can happe= n from time to time, because of hardware errors or misconfigurations. In t= he past we have had to correct it manually, which is time-consuming and whi= ch can result in downtime. > I would like to reuse as much code as possible from the NameNode in this = tool. Hopefully, the effort that is spent developing this will also make t= he NameNode editLog and image processing even more robust than it already i= s. > Another approach that we have discussed is NOT having an offline tool, bu= t just having a switch supplied to the NameNode, like "=E2=80=94auto-fix" o= r "=E2=80=94force-fix". In that case, the NameNode would attempt to "guess= " when data was missing or incomplete in the EditLog or Image-- rather than= aborting as it does now. Like the proposed fsck tool, this switch could b= e used to get users back on their feet quickly after a problem developed. = I am not in favor of this approach, because there is a danger that users co= uld supply this flag in cases where it is not appropriate. This risk does = not exist for an offline fsck tool, since it would have to be run explicitl= y. However, I wanted to mention this proposal here for completeness. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp= a For more information on JIRA, see: http://www.atlassian.com/software/jira