hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11160) VolumeScanner reports write-in-progress replicas as corrupt incorrectly
Date Thu, 01 Dec 2016 02:44:58 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Yongjun Zhang updated HDFS-11160:
    Attachment: HDFS-11160.003.patch

Hi [~jojochuang],

Thanks for your work here. I did a review of your patch here.

While the optimization discussion is still ongoing, I focused on the implementation. I think
it's not good to let BlockSender be aware of FsVolumeImpl, because it seems an abstraction
violation here.

I changed the implementation to address this and uploaded patch rev 003. Basically I think
we can have a similar API in FinalizedReplica as in RBW replica to get the last partial checksum.

A possible optimization is not to do this when the visibleLength is at chunk boundary (I have
not added this change).

I did not go through the test code yet.

Please take a look at what I changed, hope it makes sense to you.


> VolumeScanner reports write-in-progress replicas as corrupt incorrectly
> -----------------------------------------------------------------------
>                 Key: HDFS-11160
>                 URL: https://issues.apache.org/jira/browse/HDFS-11160
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>         Environment: CDH5.7.4
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-11160.001.patch, HDFS-11160.002.patch, HDFS-11160.003.patch,
> Due to a race condition initially reported in HDFS-6804, VolumeScanner may erroneously
detect good replicas as corrupt. This is serious because in some cases it results in data
loss if all replicas are declared corrupt. This bug is especially prominent when there are
a lot of append requests via HttpFs/WebHDFS.
> We are investigating an incidence that caused very high block corruption rate in a relatively
small cluster. Initially, we thought HDFS-11056 is to blame. However, after applying HDFS-11056,
we are still seeing VolumeScanner reporting corrupt replicas.
> It turns out that if a replica is being appended while VolumeScanner is scanning it,
VolumeScanner may use the new checksum to compare against old data, causing checksum mismatch.
> I have a unit test to reproduce the error. Will attach later. A quick and simple fix
is to hold FsDatasetImpl lock and read from disk the checksum.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message