hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-2702) A single failed name dir can cause the NN to exit
Date Sun, 18 Dec 2011 23:02:30 GMT

     [ https://issues.apache.org/jira/browse/HDFS-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Eli Collins updated HDFS-2702:

    Attachment: hdfs-2702.txt

Slightly updated patch. I made FSEditLog#logEdit throw an AssertionError (rather than just
assert) so we stop the NN if there's a bug where we forget to remove an edit stream after
we notice a failed directory. This should never fire, but could if we introduced a bug where
eg we missed a call to removeEdits. Updated the test to check that we can't log an edit if
there are no streams.
> A single failed name dir can cause the NN to exit 
> --------------------------------------------------
>                 Key: HDFS-2702
>                 URL: https://issues.apache.org/jira/browse/HDFS-2702
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>            Priority: Critical
>         Attachments: hdfs-2702.txt, hdfs-2702.txt, hdfs-2702.txt
> There's a bug in FSEditLog#rollEditLog which results in the NN process exiting if a single
name dir has failed. Here's the relevant code:
> {code}
> close()  // So editStreams.size() is 0 
> foreach edits dir {
>   ..
>   eStream = new ...  // Might get an IOE here
>   editStreams.add(eStream);
> } catch (IOException ioe) {
>   removeEditsForStorageDir(sd);  // exits if editStreams.size() <= 1  
> }
> {code}
> If we get an IOException before we've added two edits streams to the list we'll exit,
eg if there's an error processing the 1st name dir we'll exit even if there are 4 valid name
dirs. The fix is to move the checking out of removeEditsForStorageDir (nee processIOError)
or modify it so it can be disabled in some cases, eg here where we don't yet know how many
streams are valid.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message