hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2305) Running multiple 2NNs can result in corrupt file system
Date Wed, 31 Aug 2011 23:05:10 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094979#comment-13094979
] 

Aaron T. Myers commented on HDFS-2305:
--------------------------------------

@Todd - yep, I agree. Option 3 as-described is a subset of HDFS-903.

Perhaps, then, the thing to do for this JIRA is to do a partial back-port HDFS-903 to the
security 0.20-security branch in such a way so as to not require a change to the layout version.
Since the goal of this JIRA is just to prevent receiving the wrong fsimage during checkpointing,
not verify validity of the fsimage coming off disk, there's no need to store the checkpoint
in the VERSION file. Rather, we can just compute the checksum on the fly, but still send it
with the CheckpointSignature during checkpointing.

> Running multiple 2NNs can result in corrupt file system
> -------------------------------------------------------
>
>                 Key: HDFS-2305
>                 URL: https://issues.apache.org/jira/browse/HDFS-2305
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.2
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>
> Here's the scenario:
> * You run the NN and 2NN (2NN A) on the same machine.
> * You don't have the address of the 2NN configured, so it's defaulting to 127.0.0.1.
> * There's another 2NN (2NN B) running on a second machine.
> * When a 2NN is done checkpointing, it says "hey NN, I have an updated fsimage for you.
You can download it from this URL, which includes my IP address, which is x"
> And here's the steps that occur to cause this issue:
> # Some edits happen.
> # 2NN A (on the NN machine) does a checkpoint. All is dandy.
> # Some more edits happen.
> # 2NN B (on a different machine) does a checkpoint. It tells the NN "grab the newly-merged
fsimage file from 127.0.0.1"
> # NN happily grabs the fsimage from 2NN A (the 2NN on the NN machine), which is stale.
> # NN renames edits.new file to edits. At this point the in-memory FS state is fine, but
the on-disk state is missing edits.
> # The next time a 2NN (any 2NN) tries to do a checkpoint, it gets an up-to-date edits
file, with an outdated fsimage, and tries to apply those edits to that fsimage.
> # Kaboom.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message