hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Foley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-1955) HDFS-1826 made FSImage.doUpgrade() too fault-tolerant
Date Fri, 17 Jun 2011 07:35:47 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Matt Foley updated HDFS-1955:

    Attachment: hdfs-1955_1.patch

Here is a patch that provides the desired check, failing doUpgrade() if any storage directory
fails.  The change in FSImage is just a few lines, and easily validated by inspection. 

However, providing a unit test for it was very difficult. The problem is that failure must
be forced *within* the doUpgrade() method itself, which is buried in the Namenode startup
code, and quite well protected.  First I tried to make the storage dir read-only, but that
gets caught in recoverTransitionRead() well before invoking doUpgrade().  Second I looked
at using Mockito, but it seems that in order to spy on the startup/upgrade process one would
have to mock the entire stack of HDFS system objects.  The invocation of NNStorage.rename()
at line 367 of FSImage would be a convenient spy target, but it is static and I saw no way
to get hold of it.  Third, I rejected non-mock test parameters in production code.

Finally I just tested it manually by temporarily hacking the code in doUpgrade() to force
the error.  I was able to validate my patch, and also found and fixed an NPE bug in FSEditLog.

> HDFS-1826 made FSImage.doUpgrade() too fault-tolerant
> -----------------------------------------------------
>                 Key: HDFS-1955
>                 URL: https://issues.apache.org/jira/browse/HDFS-1955
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.22.0, 0.23.0
>            Reporter: Matt Foley
>            Assignee: Matt Foley
>         Attachments: hdfs-1955_1.patch
> Prior to HDFS-1826, doUpgrade() would fail if any of the storage directories failed to
successfully write the new fsimage or edits files.
> Now it appears to "succeed" even if some or all of the individual directories fail.
> There is some discussion about whether doUpgrade() should have some fault tolerance,
but for now make it fail on any single storage directory failure, as before.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message