hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3004) Implement Recovery Mode
Date Tue, 13 Mar 2012 04:17:55 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228186#comment-13228186

Eli Collins commented on HDFS-3004:

bq. The first choice isn't always skip-- sometimes it's "truncate."

Why would a user choose "always choose 1st"? The user doesn't know if future errors are skippable
or not-skippable so when they select "always choose first" on a skippable prompt they don't
know that they're signing up for a future truncate. Seems like we need to make the order consistent
if we're going to give people a "Yes to all" option.

- Per above, What is the "TODO: attempt to resynchronize stream here" for?
- Should the catch of Throwable catch IOException like it used to? We're not trying to catch
new types of exceptions in the non-recovery case right?
- Do we need to sanity check dfs.namenode.num.checkpoints.retained in recovery mode? Ie since
we do roll the log is there anyway that we could load an image/log, truncate it in recovery
mode, then not retain the old log?
- TestRecoverTruncatedEditLog still doesn't check that we actually truncated the log, eg even
if we didn't truncate the log the test would still pass because the directory would still
be there
- What testing have you done? Would be good to try this on a tarball build with various corrupt
and non-corrupt images/logs.

> Implement Recovery Mode
> -----------------------
>                 Key: HDFS-3004
>                 URL: https://issues.apache.org/jira/browse/HDFS-3004
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3004.010.patch, HDFS-3004__namenode_recovery_tool.txt
> When the NameNode metadata is corrupt for some reason, we want to be able to fix it.
 Obviously, we would prefer never to get in this case.  In a perfect world, we never would.
 However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations.
 In the past we have had to correct it manually, which is time-consuming and which can result
in downtime.
> Recovery mode is initialized by the system administrator.  When the NameNode starts up
in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits
log, and then write out a new image.  Then it will shut down.
> Unlike in the normal startup process, the recovery mode startup process will be interactive.
 When the NameNode finds something that is inconsistent, it will prompt the operator as to
what it should do.   The operator can also choose to take the first option for all prompts
by starting up with the '-f' flag, or typing 'a' at one of the prompts.
> I have reused as much code as possible from the NameNode in this tool.  Hopefully, the
effort that was spent developing this will also make the NameNode editLog and image processing
even more robust than it already is.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message