hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tanping Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1371) One bad node can incorrectly flag many files as corrupt
Date Fri, 15 Apr 2011 19:10:06 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020400#comment-13020400

Tanping Wang commented on HDFS-1371:

Just to summaries, we have multiple options to solve this problem now
1) After DFSClient detects a bad block from one data node, DN1, it reads from the next data
node, DN2.  If client can ever read one good replica of the block, it reports to NN that DN1
has a bad replica of the block.  If client can not successfully read any replica of the block,
it does not report anything to NN.  If all the replicas of the block are bad, there is nothing
we can do to recover this anyways.  This is a simple change that only happens on DFSClient.

2) after DFSclient detects a bad block replica, it reports back to that DN directly by sending
a OP_STATUS_CHECKSUM_ERROR message.  The DN puts these blocks to the head of block scanner
to verify.  If the replica is bad, repair as how block scanner is now doing.   This way no
traffic driven to NN.  Logic changes are in block scanner and adding communication of echo
message between DN and client.  

3) client reports to NN, once NN finds that *ALL* replicas are bad, NN asks DNs to verify.
 One drawback would be a bad client can keep reporting to NN.  NN can be overwored.

> One bad node can incorrectly flag many files as corrupt
> -------------------------------------------------------
>                 Key: HDFS-1371
>                 URL: https://issues.apache.org/jira/browse/HDFS-1371
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20.1
>         Environment: yahoo internal version 
> [knoguchi@gwgd4003 ~]$ hadoop version
> Hadoop
>            Reporter: Koji Noguchi
>            Assignee: Tanping Wang
> On our cluster, 12 files were reported as corrupt by fsck even though the replicas on
the datanodes were healthy.
> Turns out that all the replicas (12 files x 3 replicas per file) were reported corrupt
from one node.
> Surprisingly, these files were still readable/accessible from dfsclient (-get/-cat) without
any problems.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message