hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-550) DataNode restarts may introduce corrupt/duplicated/lost replicas when handling detached replicas
Date Wed, 09 Sep 2009 15:56:57 GMT

    [ https://issues.apache.org/jira/browse/HDFS-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753134#action_12753134

dhruba borthakur commented on HDFS-550:

> So the recovery adds duplicate replicas of the same block id to a datanode disk .......

This is true. This is harmless because they will get deleted in the next block report, isn't

> at least DataNode should treat the detached directory as a tmp directory and removes
its content at Startup time...... (B)

I completely agree. The reason this is being done is because the FileUtil.renameFile() is
not atomic on Windows. Hence it is better to have issue A (as marked above) rather than losing
blocks on Windows platform.

 > My unit tests test only the new recovery works or not.

I am fine with the code restructuring that you have done. +1. However, if this is supposed
to fix a bug in the current code, isn't it a good idea to first write a unit test to trigger
that problem and then demonstrate that the unit test works well with the code in the patch.

> DataNode restarts may introduce corrupt/duplicated/lost replicas when handling detached
> ------------------------------------------------------------------------------------------------
>                 Key: HDFS-550
>                 URL: https://issues.apache.org/jira/browse/HDFS-550
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: data-node
>    Affects Versions: 0.21.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: Append Branch
>         Attachments: detach.patch
> Current trunk first calls detach to unlinks a finalized replica before appending to this
block. Unlink is done by temporally copying the block file in the "current" subtree to a directory
called "detach" under the volume's daa directory and then copies it back when unlink succeeds.
On datanode restarts, datanodes recover faied unlink by copying replicas under "detach" to
> There are two bugs with this implementation:
> 1. The "detach" directory does not include in a snapshot. so rollback will cause the
"detaching" replicas to be lost.
> 2. After a replica is copied to the "detach" directory, the information of its original
location is lost. The current implementation erroneously assumes that the replica to be unlinked
is under "current". This will make two instances of replicas with the same block id to coexist
in a datanode. Also if a replica under "detach" is corrupt, the corrupt replica is moved to
"current" without being detected, polluting datanode data. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message