hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3004) Implement Recovery Mode
Date Wed, 07 Mar 2012 20:50:57 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224698#comment-13224698
] 

Eli Collins commented on HDFS-3004:
-----------------------------------

Overall approach looks good.

- Wr edit logs in the namenode directory that "seem" to have a higher txid than the current
txid, isn't the idea that we have an option to actually truncate the last edit from the log?
Ie in this patch you're asking if the user would like to truncate but not actually truncating

- Is the move of the re-check of maxSeenTxid cleanup or actually necessary now? I agree the
re-check doesn't look necessary though now we bail before adding found images if we can't
find the maxSeenTxId in the SD images, not sure that's OK.
- logTruncateMessage should probably be WARN instead of ERROR since we're doing it intentionally
(ie this code path isn't an error case), but we want it to have a high log level so we always
see it.
- In the arg checking loop can just test for one additional argument rather than looping since
we only support 1 argument
- Looks like loadEditRecords used to throw EditLogInputException in cases it now throws IOE.
Also, let's pull the recovery code out to a separate method vs implementing inline in the
catch block. It may even make sense to have a separate loadEditRecordsWithRecovery method
- Needs some more test cases, eg w/ and w/o yes to all, and that if you restart the cluster
after the recovery the fs state matches the intended state (ie if the last edit created a
file check that file is not present, but the rest of the state is in order)
- Easier if RecoveryContext#ask used var args?
- New files need the apache license header
- Testing?  Aside from running the tests would be good to try from a tarball install and start
the NN with recovery, check the various options

Style nits:
- I'd rename "yesToAll" to something like "recoverYesToAll"
so its clear that its recovery related
- Method declarations should have an empty line between them
- would rename EditLogInputStream var "l" "editIn" to be consistent with the rest of the file.
And long "e" somethign more descriptive like "txId"
- Both brackets go on the same line in else and catch clauses (eg "} else {", eg "} catch
(..) {"
- "can't understand" and "e.getMessage()" lines need indentation
- use postfix increment to be consistent (eg txId++ vs ++txId) when it doesn't matter
- the opening bracket for a method goes on the same line as the throws clause (eg "throws
IOE {")
                
> Implement Recovery Mode
> -----------------------
>
>                 Key: HDFS-3004
>                 URL: https://issues.apache.org/jira/browse/HDFS-3004
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3004.006.patch, HDFS-3004__namenode_recovery_tool.txt
>
>
> When the NameNode metadata is corrupt for some reason, we want to be able to fix it.
 Obviously, we would prefer never to get in this case.  In a perfect world, we never would.
 However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations.
 In the past we have had to correct it manually, which is time-consuming and which can result
in downtime.
> Recovery mode is initialized by the system administrator.  When the NameNode starts up
in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits
log, and then write out a new image.  Then it will shut down.
> Unlike in the normal startup process, the recovery mode startup process will be interactive.
 When the NameNode finds something that is inconsistent, it will prompt the operator as to
what it should do.   The operator can also choose to take the first option for all prompts
by starting up with the '-f' flag, or typing 'a' at one of the prompts.
> I have reused as much code as possible from the NameNode in this tool.  Hopefully, the
effort that was spent developing this will also make the NameNode editLog and image processing
even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message