hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3540) Further improvement on recovery mode and edit log toleration in branch-1
Date Wed, 05 Sep 2012 18:24:08 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448966#comment-13448966
] 

Colin Patrick McCabe commented on HDFS-3540:
--------------------------------------------

Let me go into a little more detail here.

When we were originally talking about Recovery Mode, one big concern we had was that system
administrators would overuse Recovery Mode to fix issues that might be better addressed in
a different way.  Of course, it's impossible to prevent all misuse-- human beings are not
perfect, and any tool can be misused.  That's the reason why we made recovery mode a startup
option, rather than a configuration.  It would be too easy for people to set the configuration
and then leave it set even after the problem was gone.  That's also the reason why an NameNode
in RM exits as soon as it has loaded the edit log and written a new FSImage.  This was all
discussed in HDFS-3004.

Obviously edit log toleration goes against those assumptions, and in a way that frankly, I
think is very dangerous.

Recovery Mode is generally an extensible concept.  Since it has nothing to do with the physical
structure of the edit log on-disk, it can be extended to handle arbitrary types of corruption.
 For example, what if you encounter an edit that relies on a directory that doesn't exist
(because of corruption earlier in the log)?  This is something that recovery mode could conceivably
handle by displaying a prompt and asking "would you like to create the parent directory for
the directory this edit references?"

Edit Log Toleration is not extensible.  It can only ever handle one type of corruption: tail
corruption.  But we rarely see tail corruption any more, since FSEditLog preallocation was
improved in branch-1 (HDFS-3596).  I can't think of a single case of tail corruption we've
seen in the past few months.  Many of the cases of corruption we've seen have been HDFS-3652,
and edit log toleration is inherently useless for this purpose.  Missing features can be fixed;
inherent uselessness cannot.

And these are just the technical arguments.  There's many more convincing process-based arguments.
 branch-1 is a stable branch.  We should be fixing bugs, not making major changes.  We should
be trying to minimize the divergence between branch-1 and branch-2, not amplify it.  People
already know how to use recovery mode.  We're not going to retrain people to use an (in my
opinion more error-prone) system that does the same thing.

Let's just fix the bugs we have (I have pointed out some in this thread), get stuff working,
and focus our efforts on the future not the past.
                
> Further improvement on recovery mode and edit log toleration in branch-1
> ------------------------------------------------------------------------
>
>                 Key: HDFS-3540
>                 URL: https://issues.apache.org/jira/browse/HDFS-3540
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 1.2.0
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>
> *Recovery Mode*: HDFS-3479 backported HDFS-3335 to branch-1.  However, the recovery mode
feature in branch-1 is dramatically different from the recovery mode in trunk since the edit
log implementations in these two branch are different.  For example, there is UNCHECKED_REGION_LENGTH
in branch-1 but not in trunk.
> *Edit Log Toleration*: HDFS-3521 added this feature to branch-1 to remedy UNCHECKED_REGION_LENGTH
and to tolerate edit log corruption.
> There are overlaps between these two features.  We study potential further improvement
in this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message