hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Foley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1562) Add rack policy tests
Date Tue, 12 Apr 2011 21:13:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019049#comment-13019049

Matt Foley commented on HDFS-1562:

Hi Eli, thanks for pointing out the relationship between these two bugs.  If you wish, since
you are effectively rewriting the whole unit test file, I'll withdraw my patch except for
a one-line fix which should not interfere with auto-merge when you submit.

That said, I think your patch may be subject to the same problem as I fix in HDFS-1828:

The primary problem in HDFS-1828 was that testSufficientlyReplicatedBlocksWithNotEnoughRacks()
waited "while ((numRacks < 2) || (curReplicas != REPLICATION_FACTOR) || (neededReplicationSize
> 0))" [line 79], and then asserted "(curReplicas == REPLICATION_FACTOR)" [line 95]; when
in fact it was appropriate under the circumstances of the test, to expect curReplicas == REPLICATION_FACTOR+1,

It looks like the same issue remains in your patch, because in method waitForReplication(),
it waits "while ((curRacks < racks || curReplicas < replicas || curNeededReplicas >
neededReplicas) && count < 10)", and then does "assertEquals(replicas, curReplicas)".
 So it will have the same problem, unless you never use this in a context where curReplicas
> replicas might occur.

A couple additional suggestions:

1. You added a waitForReplication() method.  Can you instead use DFSTestUtil.waitReplication()?
 (and BTW, this method correctly checks for replication being != rather than < expected
value.)  Or if you need the block-oriented signature of your version, can you consider adding
it to DFSTestUtil instead of leaving it just in the one unit test module?

2. I'm concerned about waitForCorruptReplicas(), because it is polling for a problematic condition
that is supposed to be self-healing, and uses a fairly coarse poll frequency (a whole second).
 It is possible for such a test to "miss" the condition it is trying to catch.  See HDFS-1806,
where I just fixed such a problem by changing a polling frequency from 100ms to 5ms.  

Now, I haven't had time to fully understand the tests in your new version.  It may be that
you are controlling for other parameters, such as the values of DFS_HEARTBEAT_INTERVAL, DFS_BLOCKREPORT_INTERVAL,
and DFS_NAMENODE_REPLICATION_INTERVAL, that would prevent the condition from self-healing
in the time period over which you are waiting for it.  But I have seen corrupt replicas be
recognized and eliminated in less than a second, on a tiny cluster under proper intersection
of events.  Since such issues become long-lived intermittent false positives for lots of people
on Hudson :-) I hope you don't mind my asking you to reason through an explanation that this
construct can't miss its condition.  Thanks.

> Add rack policy tests
> ---------------------
>                 Key: HDFS-1562
>                 URL: https://issues.apache.org/jira/browse/HDFS-1562
>             Project: Hadoop HDFS
>          Issue Type: Test
>          Components: name-node, test
>    Affects Versions: 0.23.0
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>         Attachments: hdfs-1562-1.patch, hdfs-1562-2.patch
> The existing replication tests (TestBlocksWithNotEnoughRacks, TestPendingReplication,
TestOverReplicatedBlocks, TestReplicationPolicy, TestUnderReplicatedBlocks, and TestReplication)
are missing tests for rack policy violations.  This jira adds the following tests which I
created when generating a new patch for HDFS-15.
> * Test that blocks that have a sufficient number of total replicas, but are not replicated
cross rack, get replicated cross rack when a rack becomes available.
> * Test that new blocks for an underreplicated file will get replicated cross rack. 
> * Mark a block as corrupt, test that when it is re-replicated that it is still replicated
across racks.
> * Reduce the replication factor of a file, making sure that the only block that is across
racks is not removed when deleting replicas.
> * Test that when a block is replicated because a replica is lost due to host failure
the the rack policy is preserved.
> * Test that when the execss replicas of a block are reduced due to a node re-joining
the cluster the rack policy is not violated.
> * Test that rack policy is still respected when blocks are replicated due to node decommissioning.
> * Test that rack policy is still respected when blocks are replicated due to node decommissioning,
even when the blocks are over-replicated.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message