hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1371) One bad node can incorrectly flag many files as corrupt
Date Sat, 04 Sep 2010 01:05:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906197#action_12906197
] 

Konstantin Shvachko commented on HDFS-1371:
-------------------------------------------

We discussed this with Rob and Nicholas.
- We should rather leave the notification logic in place as as. 
I think that if there is any suspicion that some data is bad it is better that NN knows about
it right away. 
Rather than first going through an additional verification procedure with data-nodes.
Bad clients are rare and we should optimize for the regular case.
- We should still do something with bad clients marking good blocks as corrupt.
The proposal is to add the verification logic to the name-node.
When the NN encounters that *all* replicas of a block are corrupt it requests the respective
data-nodes 
to verify their replicas. DNs verify and either confirm the corruptness or repair the replica
state on the name-node.
- NN should not worry until all replicas are corrupt as the general replication logic should
recover the block.
- This will minimize changes and utilize existing replica verification and restoration procedures.

> One bad node can incorrectly flag many files as corrupt
> -------------------------------------------------------
>
>                 Key: HDFS-1371
>                 URL: https://issues.apache.org/jira/browse/HDFS-1371
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20.1
>         Environment: yahoo internal version 
> [knoguchi@gwgd4003 ~]$ hadoop version
> Hadoop 0.20.104.3.1007030707
>            Reporter: Koji Noguchi
>
> On our cluster, 12 files were reported as corrupt by fsck even though the replicas on
the datanodes were healthy.
> Turns out that all the replicas (12 files x 3 replicas per file) were reported corrupt
from one node.
> Surprisingly, these files were still readable/accessible from dfsclient (-get/-cat) without
any problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message