hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline
Date Fri, 22 Aug 2014 17:12:14 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107107#comment-14107107
] 

Yongjun Zhang commented on HDFS-3875:
-------------------------------------

HI [~kihwal],

Thanks for your earlier work for this issue. We are seeing a similar problem like this though
we have this patch. One question about this patch:

Assuming we have a pipeline of three DNs, DN1, DN2, and DN3. DN3 detects a checksum error,
and reports back to  DN2. DN2 decided to truncate its replica to the acknowledged size by
calling {{static private void truncateBlock(File blockFile, File metaFile,}} which reads the
data from the local replica file, calculate the checksum for the length to be truncated to,
and write the checksum back to the meta file. 

My question is, when writing back the checksum to the meta file, this method doesn't check
against an already computed checksum to see if it matches. However, DN3 does check its computed
checksum against the checksum sent from upstream of the pipeline when reporting the checksum
mismatch. If DN2 got something wrong in the truncateBlock method (say, for some reason the
existing data is corrupted), then DN2 has incorrect cheksum and it's not aware of it. Then
later when we try to recover the pipeline, and use DN2 replica as the source, the new DN that
receives data from the DN2 will always find checksum error.

This is my speculation so far. Do you think this is a possibility? 

Thanks a lot.



> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
>                 Key: HDFS-3875
>                 URL: https://issues.apache.org/jira/browse/HDFS-3875
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Kihwal Lee
>            Priority: Critical
>             Fix For: 3.0.0, 2.1.0-beta, 0.23.8
>
>         Attachments: hdfs-3875-wip.patch, hdfs-3875.branch-0.23.no.test.patch.txt, hdfs-3875.branch-0.23.patch.txt,
hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.branch-2.patch.txt,
hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt,
hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt,
hdfs-3875.trunk.with.test.patch.txt
>
>
> We saw this issue with one block in a large test cluster. The client is storing the data
with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it received from
the first node. We don't know if the client sent a bad checksum, or if it got corrupted between
node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it threw an exception.
The pipeline started up again with only one replica (the first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and unrecoverable
since it is the only replica



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message