hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
Date Sat, 17 Aug 2019 11:04:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909643#comment-16909643
] 

Wei-Chiu Chuang commented on HDFS-13709:
----------------------------------------

Thanks [~zhangchen]
 Can you help verify the failed tests are unrelated?
 Additionally, it would be great if you can add a few javadoc comments for the new handleBadBlock()
method. Its logic can be a little convoluted given that there are two asynchronous threads
involved (datanode and volume scanner) We definitely want to avoid a situation where volumescanner
finds a suspect, calling handleBadBlock() and then the suspect is put into voumescanner's
queue and get scanned again and again non-stop.

nit
{code:java}
assertTrue(replicaCount == 1);
{code}
better to use 
{code:java}
assertEquals("error message", 1, replicaCount);
 {code}

> Report bad block to NN when transfer block encounter EIO exception
> ------------------------------------------------------------------
>
>                 Key: HDFS-13709
>                 URL: https://issues.apache.org/jira/browse/HDFS-13709
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: Chen Zhang
>            Assignee: Chen Zhang
>            Priority: Major
>         Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, HDFS-13709.004.patch,
HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes disk bad
track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs on A's replica
data, and someday B and C crushed at the same time, NN will try to replicate data from A but
failed, this block is corrupt now but no one knows, because NN think there is at least 1 healthy
replica and it keep trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO error, if
DN reports the bad block as soon as it got an EIO,  we can find this case ASAP and try to
avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message