hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11022) DataNode unable to remove corrupt block replica due to race condition
Date Mon, 17 Oct 2016 22:20:58 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Wei-Chiu Chuang updated HDFS-11022:
    Attachment: HDFS-11022.png

Attach a diagram if it is easier for people to understand.

> DataNode unable to remove corrupt block replica due to race condition
> ---------------------------------------------------------------------
>                 Key: HDFS-11022
>                 URL: https://issues.apache.org/jira/browse/HDFS-11022
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.6.0
>         Environment: CDH5.7.0
>            Reporter: Wei-Chiu Chuang
>            Priority: Critical
>         Attachments: HDFS-11022.png
> Scenario:
> # A client reads a replica blk_A_x from a data node and detected corruption.
> # In the meantime, the replica is appended, updating its generation stamp from x to y.
> # The client tells NN to mark the replica blk_A_x corrupt.
> # NN tells the data node to (1) delete replica blk_A_x and (2) replicate the newer replica
blk_A_y from another datanode. Due to block placement policy, blk_A_y is replicated to the
same node. (It's a small cluster)
> # DN is unable to receive the newer replica blk_A_y, because the replica already exists.
> # DN is also unable to delete replica blk_A_y because blk_A_y does not exist.
> # The replica on the DN is not part of data pipeline, so it becomes stale.
> If another replica becomes corrupt and NameNode wants to replicate a healthy replica
to this DataNode, it can't, because a stale replica exists. Because this is a small cluster,
soon enough (in a matter of a hour) no DataNode is able to receive a healthy replica.
> This cluster also suffers from HDFS-11019, so even though DataNode later detected data
corruption, it was unable to report to NameNode.
> Note that we are still investigating the root cause of the corruption.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message