hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Foley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1562) Add rack policy tests
Date Thu, 21 Apr 2011 18:51:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022868#comment-13022868

Matt Foley commented on HDFS-1562:

This is a fairly large patch, so I put it on ReviewBoard to see a clear diff: https://reviews.apache.org/r/645/

I really like the nice clean new TestBlocksWithNotEnoughRacks, using the methods Eli added
to DFSTestUtil, and I commend the work he put into refactoring those methods so they can be

Since the DFSTestUtil methods are intended for re-use, I went over them in some detail, and
have the following comments:

1. DFSTestUtil#readFile(FileSystem fs, Path fileName)
The previously existing readFile(File f) reads as much data as the file has.  This routine
only reads 1KB at most.  Should they both do the same, whichever is better?

2. isBlockCorrupt() will only return true if ALL replicas of the block are corrupt, so I suggest
naming it something like "isAllBlockReplicasCorrupt()"

3. I strongly feel that timeouts should (i) throw TimeoutException, and (ii) message ALL the
values being waited on, not just one or none.  The following "wait for condition" methods
don't follow these rules:

(a) waitForReplication(MiniDFSCluster cluster, Block b, int racks, int replicas, int neededReplicas)
The current code does a println that only says a timeout occurred, then falls into three asserts,
only one of which will trigger. Asserts imply a logic error, while TimeoutException implies
a timeout occurred, and the exception message should include the current values of all three
waited-on values, which are likely inter-related for debug purposes.

Also, "&& count < 10" should be "&& count < ATTEMPTS"

(b) waitCorruptReplicas() should use TimeoutException, and the error msg should say the failure
state of the value being waited on (repls).

(c) same for waitForDecommission()

4. In the above wait loops, 10 seconds is the timeout.  Is that long enough?  It is barely
three 3-sec cycles.  Would 20sec. be better?

5. corruptReplica() - The original code in TestDatanodeBlockScanner searched for and corrupted
all corresponding block files on a datanode, but returned binary true/false.  This proposed
code returns the actual number of such block files found.  That might not be a good idea,
as clients (such as corruptBlockOnDataNodes()) probably assume the sum over all datanodes
will equal the number of replicas.

Perhaps it would be best for corruptReplica() itself to throw an exception if count > 1,
rather than rely on every client to do so?  Then it could just return a boolean for the two
valid counts (0 and 1).  IIRC, every client except corruptBlockOnDataNodes() is in fact testing
that the result == 1.

6. doTestEntirelyCorruptFile() line 261 asserts "equals" but the error message says "too few
blocks", implying "less than".  One or the other should be changed to agree.

7. NameNodeAdapter#getReplicaInfo(NameNode namenode, Block b) is supposed to return info about
specific block b,
but the 3rd element of the tuple is "ns.blockManager.neededReplications.size()", which is
for all blocks.
Wouldn't it be appropriate to check ns.blockManager.neededReplications.contains(b), and if
so iterate to find the count?

Sorry this is long.  Overall it's a great piece of work.

> Add rack policy tests
> ---------------------
>                 Key: HDFS-1562
>                 URL: https://issues.apache.org/jira/browse/HDFS-1562
>             Project: Hadoop HDFS
>          Issue Type: Test
>          Components: name-node, test
>    Affects Versions: 0.23.0
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>         Attachments: hdfs-1562-1.patch, hdfs-1562-2.patch, hdfs-1562-3.patch
> The existing replication tests (TestBlocksWithNotEnoughRacks, TestPendingReplication,
TestOverReplicatedBlocks, TestReplicationPolicy, TestUnderReplicatedBlocks, and TestReplication)
are missing tests for rack policy violations.  This jira adds the following tests which I
created when generating a new patch for HDFS-15.
> * Test that blocks that have a sufficient number of total replicas, but are not replicated
cross rack, get replicated cross rack when a rack becomes available.
> * Test that new blocks for an underreplicated file will get replicated cross rack. 
> * Mark a block as corrupt, test that when it is re-replicated that it is still replicated
across racks.
> * Reduce the replication factor of a file, making sure that the only block that is across
racks is not removed when deleting replicas.
> * Test that when a block is replicated because a replica is lost due to host failure
the the rack policy is preserved.
> * Test that when the execss replicas of a block are reduced due to a node re-joining
the cluster the rack policy is not violated.
> * Test that rack policy is still respected when blocks are replicated due to node decommissioning.
> * Test that rack policy is still respected when blocks are replicated due to node decommissioning,
even when the blocks are over-replicated.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message