hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Foley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-1855) TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in two different ways
Date Thu, 21 Apr 2011 07:24:06 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Matt Foley updated HDFS-1855:

    Attachment: TestDatanodeBlockScanner_bug_v1.patch

In method blockCorruptionRecoveryPolicy(), 5 nodes are created, 3 with replicas of a certain
block.  Two of those replicas, in the nodes at index [0] and [1], are deliberately corrupted.
 Then it attempts to restart those two nodes so the corruption will be detected.

The loop that is intended to restart both datanodes starts with [0].  But when it restarts
[0], it is removed from the MiniCluster's arraylist and re-added to the end.  As a result,
[1] moves to [0].  But the loop then restarts the new [1], which was the former [2], which
doesn't contain a corrupt replica.  As a result, the corrupt replica in the former [1] never
gets detected.

In resolving the corruption, one of two errors can happen, with probability 50%:  Since the
namenode thinks it still has two good replicas, it may pick the corrupt replica as the source
for re-replication.  That will cause a checksum error at the receiving node.

Alternatively, it may pick the one valid replica as the source, and replicate it, and delete
the bad replica from the original [0].  However, since it doesn't know that the replica on
the former [1] is corrupt, it never issues the delete request.  This causes the test case
to time out on the wait for corrupt replica deletion.

This problem is resolved by looping from high [1] to low [0], as is done in certain MiniDFSCluster

> TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in two different
> -----------------------------------------------------------------------------------------------
>                 Key: HDFS-1855
>                 URL: https://issues.apache.org/jira/browse/HDFS-1855
>             Project: Hadoop HDFS
>          Issue Type: Test
>          Components: test
>    Affects Versions: 0.22.0
>            Reporter: Matt Foley
>            Assignee: Matt Foley
>             Fix For: 0.22.0, 0.23.0
>         Attachments: TestDatanodeBlockScanner_bug_v1.patch
> The second part of test case TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy(),
"corrupt replica recovery for two corrupt replicas", always fails, half the time with a checksum
error upon block replication, and half the time by timing out upon failure to delete the second
corrupt replica.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message