hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
Date Fri, 11 May 2012 06:24:44 GMT

     [ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Todd Lipcon updated HDFS-3391:

    Attachment: hdfs-3391.txt

This attached patch seems to fix the issue, even with HDFS-3157 and the above troublesome
sleep() call in place.

I think what was happening here was the following:

- in some cases, the block synchronization path can run twice, if the first attempt is slow.
This ends up first finalizing the block at genstamp 1005, and then again at 1006 or 1007.
- for each of those genstamps, the DNs report FINALIZED replicas to both NNs.
- When the new NN becomes active, then, it replays the block reports -- first FINALIZED for
blk_N_1005, and then FINALIZED for blk_N_1006.
- When it sees the blk_N_1005 genstamp, it already knows that 1006 is the "correct" latest
genstamp for the block, so it wants to mark it as corrupt.

Here is where the behavior differs:

Prior to HDFS-3157, it was marking blk_N_1006 as corrupt instead of blk_N_1005. Thus the markBlockAsCorrupt()
call would succeed. When processing the FINALIZED blk_N_1006, it would remove it from the
corrupt list, and everything would be fine.

With HDFS-3157 in place, it instead marks blk_N_1005 as corrupt. However, the BlockInfo object
it creates to do so has no attached inode (BlockCollection in new parlance). So, markBlockAsCorrupt
immediately enqueued the replica for invalidation, rather than treating it like a normal corrupt
replica. Then, upon seeing the report of the blk_N_1006 FINALIZED replica, the check against
invalidateBlocks.contains(block) caused it to be skipped, and thus addStoredBlock() didn't
get called.

The fix in this patch is to change invalidateBlocks so that its contains() call can check
for genstamp match as well. So, even though blk_N_1005 has been enqueued for deletion, we
should still accept a block report for blk_N_1006.
> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---------------------------------------------------------------
>                 Key: HDFS-3391
>                 URL: https://issues.apache.org/jira/browse/HDFS-3391
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Arun C Murthy
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hdfs-3391.txt
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<<
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec <<<
> --

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message