hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-3875) Issue handling checksum errors in write pipeline
Date Mon, 14 Jan 2013 22:18:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553193#comment-13553193
] 

Suresh Srinivas edited comment on HDFS-3875 at 1/14/13 10:16 PM:
-----------------------------------------------------------------

Kihwal, here is how I understand the new behavior. Correct me if I am wrong. In the following
scenarios, client is writing in a pipeline to datanodes d1, d2 and d3. At each point in the
pipeline the data is marked as corrupt or not.

client(not corrupt) d1(not corrupt) d2(not corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives CHECKSUM_ERROR and
shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.

Only d1 is considered to be valid copy even though d2 may not be corrupt.

client(not corrupt) d1(not corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives CHECKSUM_ERROR and
shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.

Only d1 is considered to be valid copy.

client(not corrupt) d1(corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives CHECKSUM_ERROR and
shutsdown
* _d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR._
d1 is still considered a valid coyp. Is this correct?

client(corrupt) d1(corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives CHECKSUM_ERROR and
shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.

d1 is still considered a valid copy.

In all the above cases whether a node detects checksum error or the downstream detects checksum
error the results appears the same to the upstream nodes (as mirror error). Is that what you
intended?

                
      was (Author: sureshms):
    Kihwal, here is how I understand the new behavior. Correct me if I am wrong. In the following
scenarios, client is writing in a pipeline to datanodes d1, d2 and d3. At each point in the
pipeline the data is marked as corrupt or not.

client(not corrupt) d1(not corrupt) d2(not corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives CHECKSUM_ERROR and
shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.
Only d1 is considered to be valid copy even though d2 may not be corrupt.

client(not corrupt) d1(not corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives CHECKSUM_ERROR and
shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.
Only d1 is considered to be valid copy.

client(not corrupt) d1(corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives CHECKSUM_ERROR and
shutsdown
* _d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR._
d1 is still considered a valid coyp. Is this correct?

client(corrupt) d1(corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives CHECKSUM_ERROR and
shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.
d1 is still considered a valid coyp.

In all the above cases whether a node detects checksum error or the downstream detects checksum
error the results appears the same to the upstream nodes (as mirror error). Is that what you
intended?

                  
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
>                 Key: HDFS-3875
>                 URL: https://issues.apache.org/jira/browse/HDFS-3875
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Kihwal Lee
>            Priority: Critical
>         Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt,
hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt,
hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt,
hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is storing the data
with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it received from
the first node. We don't know if the client sent a bad checksum, or if it got corrupted between
node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it threw an exception.
The pipeline started up again with only one replica (the first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and unrecoverable
since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message