Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Wed, 19 Sep 2012 13:09:07 +1100 (NCT)
From: "Andy Isaacson (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <958396434.95568.1348020547717.JavaMail.jiratomcat@arcas>
In-Reply-To: <1776147477.70934.1347475148022.JavaMail.jiratomcat@arcas>
Subject: [jira] [Updated] (HDFS-3931)
 TestDatanodeBlockScanner#testBlockCorruptionPolicy2 is broken
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HDFS-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Isaacson updated HDFS-3931:
--------------------------------

    Attachment: hdfs3931-1.txt

Proposed, hackish, fix in three parts:
* set DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC to 5 seconds
* increase delay in waitReplication so pending replication timeouts have more than one chance to kick in
* when attempting to corrupt blocks, if the blockscanner beats us in the race, retry.

In my testing with these changes, I had just one failure in 100 iterations.
                
> TestDatanodeBlockScanner#testBlockCorruptionPolicy2 is broken
> -------------------------------------------------------------
>
>                 Key: HDFS-3931
>                 URL: https://issues.apache.org/jira/browse/HDFS-3931
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>            Assignee: Andy Isaacson
>         Attachments: hdfs3931-1.txt, hdfs3931.txt
>
>
> Per Andy's comment on HDFS-3902:
> TestDatanodeBlockScanner still fails about 1/5 runs in testBlockCorruptionRecoveryPolicy2. That's due to a separate test issue also uncovered by HDFS-3828.
> The failure scenario for this one is a bit more tricky. I think I've captured the scenario below:
> - The test corrupts 2/3 replicas.
> - client reports a bad block.
> - NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
> - DN notices the incoming replica is corrupt and reports it as a bad block, but does not inform the NN that re-replication failed.
> - NN keeps the block on pendingReplications.
> - BP scanner wakes up on both DNs with corrupt blocks, both report corruption. NN reports both as duplicates, one from the client and one from the DN report above.
> since block is on pendingReplications, NN does not schedule another replication.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira