hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-3004) Create Offline NameNode recovery tool
Date Thu, 23 Feb 2012 19:24:48 GMT
Create Offline NameNode recovery tool
-------------------------------------

                 Key: HDFS-3004
                 URL: https://issues.apache.org/jira/browse/HDFS-3004
             Project: Hadoop HDFS
          Issue Type: New Feature
          Components: tools
            Reporter: Colin Patrick McCabe
            Assignee: Colin Patrick McCabe


We've been talking about creating a tool which can process NameNode edit logs and image files
offline.

This tool would be similar to a fsck for a conventional filesystem.  It would detect inconsistencies
and malformed data.  In cases where it was possible, and the operator asked for it, it would
try to correct the inconsistency.

It's probably better to call this "nameNodeRecovery" or similar, rather than "fsck," since
we already have a separate and unrelated mechanism which we refer to as fsck.

The use case here is that the NameNode data is corrupt for some reason, and we want to fix
it.  Obviously, we would prefer never to get in this case.  In a perfect world, we never would.
 However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations.
 In the past we have had to correct it manually, which is time-consuming and which can result
in downtime.

I would like to reuse as much code as possible from the NameNode in this tool.  Hopefully,
the effort that is spent developing this will also make the NameNode editLog and image processing
even more robust than it already is.

Another approach that we have discussed is NOT having an offline tool, but just having a switch
supplied to the NameNode, like "—auto-fix" or "—force-fix".  In that case, the NameNode
would attempt to "guess" when data was missing or incomplete in the EditLog or Image-- rather
than aborting as it does now.  Like the proposed fsck tool, this switch could be used to get
users back on their feet quickly after a problem developed.  I am not in favor of this approach,
because there is a danger that users could supply this flag in cases where it is not appropriate.
 This risk does not exist for an offline fsck tool, since it would have to be run explicitly.
 However, I wanted to mention this proposal here for completeness.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message