hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline
Date Mon, 25 Aug 2014 06:20:59 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108798#comment-14108798
] 

Yongjun Zhang commented on HDFS-3875:
-------------------------------------

Hi [~kihwal], I filed HDFS-6937 to track the similar issue I'm seeing, so we can continue
the discussion there. Thanks.




> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
>                 Key: HDFS-3875
>                 URL: https://issues.apache.org/jira/browse/HDFS-3875
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Kihwal Lee
>            Priority: Critical
>             Fix For: 3.0.0, 2.1.0-beta, 0.23.8
>
>         Attachments: hdfs-3875-wip.patch, hdfs-3875.branch-0.23.no.test.patch.txt, hdfs-3875.branch-0.23.patch.txt,
hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.branch-2.patch.txt,
hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt,
hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt,
hdfs-3875.trunk.with.test.patch.txt
>
>
> We saw this issue with one block in a large test cluster. The client is storing the data
with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it received from
the first node. We don't know if the client sent a bad checksum, or if it got corrupted between
node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it threw an exception.
The pipeline started up again with only one replica (the first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and unrecoverable
since it is the only replica



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message