hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2305) Running multiple 2NNs can result in corrupt file system
Date Tue, 13 Sep 2011 01:25:10 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103250#comment-13103250
] 

Aaron T. Myers commented on HDFS-2305:
--------------------------------------

bq. Some of the new info messages should probably be debug level

There were only a few new info messages. I changed one of them to debug, and made one other
less verbose, since some of the info is only relevant in the event of an error, and in that
case the extra info is printed as part of the exception.

bq. Do we also need to add some locking so that only one 2NN could be uploading an image at
the same time?

Agreed. This strictly necessary to fix the issue identified in this JIRA, but I agree that
this is a potential for error as well.

bq. getNewChecksum looks like it will leak a file descriptor

Thanks, good catch.

bq. would it be easier to just backport the part of 903 that creates an "imageChecksum" member
which is updated whenever the image is merged, by the existing output stream? That would reduce
divergence between 20s and trunk. That is to say, backport HDFS-903 except for the part where
the checksum is put in the VERSION file.

I thought about doing this. Thought it seems like it would make for a more straight-forward
back-port, the back-port isn't easy regardless because of other divergences between trunk
and branch-0.20-security. So, we don't seem to be gaining much by doing it this way, and since
we wouldn't be storing the previous checksum as part of the VERSION file, we wouldn't be getting
the intended benefit of HDFS-903 ("NN should verify images and edit logs on startup.")

I'll upload a patch in a moment which addresses all of these issues, except the last one.
Todd, if you feel strongly about it, I can rework the patch as you described to be a more
faithful back-port of HDFS-903.

> Running multiple 2NNs can result in corrupt file system
> -------------------------------------------------------
>
>                 Key: HDFS-2305
>                 URL: https://issues.apache.org/jira/browse/HDFS-2305
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.2
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>         Attachments: hdfs-2305-test.patch, hdfs-2305.0.patch, hdfs-2305.1.patch
>
>
> Here's the scenario:
> * You run the NN and 2NN (2NN A) on the same machine.
> * You don't have the address of the 2NN configured, so it's defaulting to 127.0.0.1.
> * There's another 2NN (2NN B) running on a second machine.
> * When a 2NN is done checkpointing, it says "hey NN, I have an updated fsimage for you.
You can download it from this URL, which includes my IP address, which is x"
> And here's the steps that occur to cause this issue:
> # Some edits happen.
> # 2NN A (on the NN machine) does a checkpoint. All is dandy.
> # Some more edits happen.
> # 2NN B (on a different machine) does a checkpoint. It tells the NN "grab the newly-merged
fsimage file from 127.0.0.1"
> # NN happily grabs the fsimage from 2NN A (the 2NN on the NN machine), which is stale.
> # NN renames edits.new file to edits. At this point the in-memory FS state is fine, but
the on-disk state is missing edits.
> # The next time a 2NN (any 2NN) tries to do a checkpoint, it gets an up-to-date edits
file, with an outdated fsimage, and tries to apply those edits to that fsimage.
> # Kaboom.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message