hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3004) Implement Recovery Mode
Date Fri, 09 Mar 2012 19:48:56 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226378#comment-13226378

Eli Collins commented on HDFS-3004:

Your comments above make sense, thanks for the explanation.

Comments on latest patch:
- HDFS-2709 (hash 110b6d0) introduced EditLogInputException and used to have places where
it was caught explicitly, that they just catch IOE, so given that you we no longer throw this
either you can remove the class entirely
- In logTruncateMessage we should log something like "stopping edit log load at position X"
instead of saying we're truncating it because we're not actually truncating the log (from
the user's perspective)
- Isn't "always select the first choice" effectively "always skip"? Better to call it that
as users might think it means use the previously selected option for all future choices (eg
if I chose "skip" then chose "try to fix" then "always choose 1st" I might not have meant
to "always skip").
- The conditional on "answer" is probably more readable as a switch, wasn't clear that the
else clause was always "a" and therefore that's why we call recovery.setAlwaysChooseFirst()
- What is the "TODO: attempt to resynchronize stream here" for?
- Should use "s".equals(answer) instead of answer == "s" etc since if for some reason RecoveryContext
doesn't return the exact object it was passed in the future this would break
- Should RC#ask should log as info instead of error for prompt and automatically choosing
- RC#ask javadoc needs to be updated to match the method. Also, "his choice" -> "their
choice" =P
- RecoveryContext could use a high-level javadoc with a sentence or two since the name is
pretty generic and the use is very specific
- Can s/LOG.error/LOG.fatal/ in NN.java for recovery failed case
- NN#printUsage has two IMPORT lines
- ++i still used in a couple files
- brackets on their own line still need fixing eg "} else if {"
- Why does TestRecoverTruncatedEditLog make the same dir 21 times? Maybe you mean to append
"i" to the path? The test should corrupt an operation that mutates the namespace (vs the last
op which I believe is an op to finalize the log segment) so you can test that that edit is
not present when you reload (eg corrupt the edit to mkdir /foo then assert /foo does not exist
in the namespace)

> Implement Recovery Mode
> -----------------------
>                 Key: HDFS-3004
>                 URL: https://issues.apache.org/jira/browse/HDFS-3004
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3004.008.patch, HDFS-3004__namenode_recovery_tool.txt
> When the NameNode metadata is corrupt for some reason, we want to be able to fix it.
 Obviously, we would prefer never to get in this case.  In a perfect world, we never would.
 However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations.
 In the past we have had to correct it manually, which is time-consuming and which can result
in downtime.
> Recovery mode is initialized by the system administrator.  When the NameNode starts up
in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits
log, and then write out a new image.  Then it will shut down.
> Unlike in the normal startup process, the recovery mode startup process will be interactive.
 When the NameNode finds something that is inconsistent, it will prompt the operator as to
what it should do.   The operator can also choose to take the first option for all prompts
by starting up with the '-f' flag, or typing 'a' at one of the prompts.
> I have reused as much code as possible from the NameNode in this tool.  Hopefully, the
effort that was spent developing this will also make the NameNode editLog and image processing
even more robust than it already is.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message