hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tanping Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1371) One bad node can incorrectly flag many files as corrupt
Date Sat, 16 Apr 2011 01:37:06 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020512#comment-13020512
] 

Tanping Wang commented on HDFS-1371:
------------------------------------

Had a discussion with multiple people, after a thoughtful debate, the conclusion we draw at
the end is that we will take the solution that Koji suggested: 

1) With replica number that is great than 1, if DFS client detects corruption of a block replica,
it continuities to try to get another replica from another data node.  If DFS client can at
least read one good replica of the block, client reports to name node with all the bad block
replicas with their data nodes information. If DFS client can not even read one block replica,
it does not report anything to name node.   

2) with replica number that is 1, DFS reports back to the name node if it detects a block
corruption.

This is change only happens on the client side.  The existing logic remains the same on the
server end.

We take this approach because
1)we consider a bad client is a client who has "good wish" but handicapped with some physical
difficulties (such as memory problem), not a malicious client.
2) If a client can not even read one good replica, it could be a handicapped client.  In the
worst case scenario, if all the replicas of the block are corrupted, even reporting this back
to the name node, there is no repairing work can be done.  More over, based on Koji's experience,
it has never been a case that all the replicas of a block are all corrupted in our production
environment in the past years.
3)Handicapped client is extremely rare.  We do not want to put in heavy verification logic
on the name/data node end and neither want to have protocol change to just verify blocks for
this extremely rare case of handicapped client.

With that said, having change only on the client reading logic takes care of handicapped clients.
It is a light weighted solution with no need of protocol change.

> One bad node can incorrectly flag many files as corrupt
> -------------------------------------------------------
>
>                 Key: HDFS-1371
>                 URL: https://issues.apache.org/jira/browse/HDFS-1371
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20.1
>         Environment: yahoo internal version 
> [knoguchi@gwgd4003 ~]$ hadoop version
> Hadoop 0.20.104.3.1007030707
>            Reporter: Koji Noguchi
>            Assignee: Tanping Wang
>
> On our cluster, 12 files were reported as corrupt by fsck even though the replicas on
the datanodes were healthy.
> Turns out that all the replicas (12 files x 3 replicas per file) were reported corrupt
from one node.
> Surprisingly, these files were still readable/accessible from dfsclient (-get/-cat) without
any problems.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message