hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3004) Implement Recovery Mode
Date Tue, 20 Mar 2012 00:43:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233065#comment-13233065
] 

Todd Lipcon commented on HDFS-3004:
-----------------------------------

- The docs reference a '-f' flag, but the code seems to be using {{-chooseFirst}}.
- Regarding above, I think {{alwaysChooseFirst}} in the code is actually hard to understand.
I'd prefer {{autoChooseDefault}} - the fact that the default is listed first is sort of ancillary.
Similarly, rename {{firstChoice}} to {{defaultChoice}} in the {{ask}} function.
- I think "q/quit" should probably be an option for all {{ask()}} calls.
- Rename {{StartupOption.getRecoveryContext}} to {{StartupOption.createRecoveryContext}} since
it creates a new one rather than returning a singleton
- There are a few unrelated changes in FSEditLogLoader that change formatting of code that
wasn't touched in the patch
- You can't remove EditLogInputException. It's important for EditLogTailer to be able to distinguish
a transient issue with reading input from the shared directory from an actual invalid operation
-- we want to fail fast in the case of the invalid operation, whereas we want to keep retrying
in the case of a transient IO error. See HDFS-2709 for details. The fact that you had to change
TestFailureToReadEdits is a red flag here.
- Does RecoveryContext need to be a public class? If so, you need to add an InterfaceAudience
annotation to mark it as Private audience
- Same with OpInstanceCache
- It seems strange to be passing OpInstanceCache to all of the {{getInstance()}} methods.
Instead, can we kill off all the {{getInstance}} methods, and add something like the following
to OpInstanceCache:

{code}
  @SuppressWarnings("unchecked")
  public <T extends FSEditLogOp> T getInstance(FSEditLogOpCodes code, Class<T>
clazz) {
    return (T)inst.get(code);
  }
{code}

this could happen in a separate follow-up JIRA though.
- Move {{logTruncateMessage}} down lower in the file near the other logging stuff, since it's
just a utility method. Also, no need to use StringBuilder here since it's not performance-critical.
Easier to read simple string concatenation.
- In FileJournalManager, please retain the log message when reading from the middle of a stream.


- I think the {{skipUntil}} API should actually throw an exception if it could not skip to
the requested txid -- we always expect to be able to do that, with the current usage. Then,
{{putOp()}} could be made private, which would keep the EditLogInputStream API cleaner (i.e
not leak the fact that there's a one-element buffer hiding inside). I'm actually not clear
on why this change is part of this patch at all -- how is this change necessary in recovery
mode?
- Rather than adding the new {{skipBrokenEdits}} to {{nextOp()}} and {{readOp()}} I think
it's better to add a new API, like {{void skipUntilNextValidEdit()}} or {{void resyncToValidEdit()}}.
Then the next call to {{readOp()}} would return that edit. I think this is easier to understand,
and parallels the SequenceFile.Reader API we have elsewhere in Hadoop.

- It seems like, in the case that an error is a mismatched transaction ID, we want one of
the recovery choices to be "apply the edit anyway" -- eg imagine that for some reason our
log just skips from txid 1 to txid 3. When we hit txid 3, we'll see an error that we expected
a different one. But we'd like to continue applying from txid 3, ignoring the gap, right?

----
{code}
-            if (txId != expectedTxId) {
-              throw new IOException("Expected transaction ID " +
-                  expectedTxId + " but got " + txId);
+            if (!skipBrokenEdits) {
+              if (op.txid != expectedTxId) {
{code}
Rather than nested ifs, just {{&&}} the two conditions

----
Formatting nit:
{code}
+  /** Display a prompt to the user and get his or her choice.
+   *  
{code}
The first line of text in the javadoc should be on the line after the {{/**}}

                
> Implement Recovery Mode
> -----------------------
>
>                 Key: HDFS-3004
>                 URL: https://issues.apache.org/jira/browse/HDFS-3004
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, HDFS-3004.012.patch, HDFS-3004.013.patch,
HDFS-3004.015.patch, HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, HDFS-3004.019.patch,
HDFS-3004__namenode_recovery_tool.txt
>
>
> When the NameNode metadata is corrupt for some reason, we want to be able to fix it.
 Obviously, we would prefer never to get in this case.  In a perfect world, we never would.
 However, bad data on disk can happen from time to time, because of hardware errors or misconfigurations.
 In the past we have had to correct it manually, which is time-consuming and which can result
in downtime.
> Recovery mode is initialized by the system administrator.  When the NameNode starts up
in Recovery Mode, it will try to load the FSImage file, apply all the edits from the edits
log, and then write out a new image.  Then it will shut down.
> Unlike in the normal startup process, the recovery mode startup process will be interactive.
 When the NameNode finds something that is inconsistent, it will prompt the operator as to
what it should do.   The operator can also choose to take the first option for all prompts
by starting up with the '-f' flag, or typing 'a' at one of the prompts.
> I have reused as much code as possible from the NameNode in this tool.  Hopefully, the
effort that was spent developing this will also make the NameNode editLog and image processing
even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message