hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline
Date Tue, 15 Jan 2013 01:15:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553370#comment-13553370
] 

Suresh Srinivas commented on HDFS-3875:
---------------------------------------

Had an offline conversation with Kihwal. Here is one of the above scenarios in more detail
(thanks Kihwal for explaining the current behavior).

Client(not corrupt) d1(not corrupt) d2(not corrupt) d3(corrupt), where d3 for some reason
sees only corrupt data.
* d3 detects corruption and reports CHECKSUM_ERROR ACK to d2. Packet is not written to the
disk on d3.
* d2 does not verify checksum and hence status is SUCCESS, but receives CHECKSUM_ERROR and
shutsdown
* d1 does not verify checksum. Its status is {SUCCESS, MIRROR_ERROR}
* client establishes the pipeline with d1 and d3 and sends the packet again.
* d3 detects corruption again and reports CHECKSUM_ERROR ACK to d1. Packet is not written
to the disk on d3.
* d1 does not verify checksum and hence status is SUCCESS, but receives CHECKSUM_ERROR and
shutsdown
* client establishes the pipeline with d3 and sends the packet again.
* d3 detects corruption again and reports CHECKSUM_ERROR ACK to d1. Packet is not written
to the disk on d3.
* Client fails to write the packet and abandons writing the file? 

The current behavior picks the node that sees corruption (or is corrupting the data) repeatedly
in pipeline recovery (d3 above). Also the node that did not see corruption gets dropped from
the pipeline. If a datanode performs checksum verification when it gets a down stream datanode
reporting checksum error should avoid this. With this, new recovered pipleline will recover
the pipeline up to the point of corruption in the pipeline.

Kihwal, add comments if I missed some thing.
                
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
>                 Key: HDFS-3875
>                 URL: https://issues.apache.org/jira/browse/HDFS-3875
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Kihwal Lee
>            Priority: Critical
>         Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt,
hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt,
hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt,
hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is storing the data
with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it received from
the first node. We don't know if the client sent a bad checksum, or if it got corrupted between
node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it threw an exception.
The pipeline started up again with only one replica (the first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and unrecoverable
since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message