hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Foley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-1855) TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in two different ways
Date Thu, 21 Apr 2011 07:24:06 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matt Foley updated HDFS-1855:
-----------------------------

    Attachment: TestDatanodeBlockScanner_bug_v1.patch

In method blockCorruptionRecoveryPolicy(), 5 nodes are created, 3 with replicas of a certain
block.  Two of those replicas, in the nodes at index [0] and [1], are deliberately corrupted.
 Then it attempts to restart those two nodes so the corruption will be detected.

The loop that is intended to restart both datanodes starts with [0].  But when it restarts
[0], it is removed from the MiniCluster's arraylist and re-added to the end.  As a result,
[1] moves to [0].  But the loop then restarts the new [1], which was the former [2], which
doesn't contain a corrupt replica.  As a result, the corrupt replica in the former [1] never
gets detected.

In resolving the corruption, one of two errors can happen, with probability 50%:  Since the
namenode thinks it still has two good replicas, it may pick the corrupt replica as the source
for re-replication.  That will cause a checksum error at the receiving node.

Alternatively, it may pick the one valid replica as the source, and replicate it, and delete
the bad replica from the original [0].  However, since it doesn't know that the replica on
the former [1] is corrupt, it never issues the delete request.  This causes the test case
to time out on the wait for corrupt replica deletion.

This problem is resolved by looping from high [1] to low [0], as is done in certain MiniDFSCluster
methods.

> TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in two different
ways
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1855
>                 URL: https://issues.apache.org/jira/browse/HDFS-1855
>             Project: Hadoop HDFS
>          Issue Type: Test
>          Components: test
>    Affects Versions: 0.22.0
>            Reporter: Matt Foley
>            Assignee: Matt Foley
>             Fix For: 0.22.0, 0.23.0
>
>         Attachments: TestDatanodeBlockScanner_bug_v1.patch
>
>
> The second part of test case TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy(),
"corrupt replica recovery for two corrupt replicas", always fails, half the time with a checksum
error upon block replication, and half the time by timing out upon failure to delete the second
corrupt replica.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message