hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yiqun Lin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-11398) TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure still fails intermittently
Date Wed, 08 Feb 2017 10:11:41 GMT
Yiqun Lin created HDFS-11398:

             Summary: TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure still
fails intermittently
                 Key: HDFS-11398
                 URL: https://issues.apache.org/jira/browse/HDFS-11398
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 3.0.0-alpha2
            Reporter: Yiqun Lin
            Assignee: Yiqun Lin

The test {{TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure}} still fails intermittently
in trunk after HDFS-11316. The stack infos:
java.util.concurrent.TimeoutException: Timed out waiting for DN to die
	at org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
	at org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:218

I looked into this and found there is one chance that the vaule {{UnderReplicatedBlocksCount}}
will be no longer > 0. The following is my analysation:
In test {{TestDataNodeVolumeFailure.testUnderReplicationAfterVolFailure}}, it uses creating
file to trigger the disk error checking. The related codes:
    Path file1 = new Path("/test1");
    DFSTestUtil.createFile(fs, file1, 1024, (short)3, 1L);
    DFSTestUtil.waitReplication(fs, file1, (short)3);

    // Fail the first volume on both datanodes
    File dn1Vol1 = new File(dataDir, "data"+(2*0+1));
    File dn2Vol1 = new File(dataDir, "data"+(2*1+1));

    DataNodeTestUtils.injectDataDirFailure(dn1Vol1, dn2Vol1);
    Path file2 = new Path("/test2");
    DFSTestUtil.createFile(fs, file2, 1024, (short)3, 1L);
    DFSTestUtil.waitReplication(fs, file2, (short)3);
This will lead one problem: If the cluster is busy, and it coats so long time to wait replication
of file2 to be desired value. And there is one chance that the under replication blocks of
file1 can also be replication in cluster. If this is done, the condition {{underReplicatedBlocks
> 0}} will be never satisfied.
And this can be reproduced in my local env.

Actually, we can use a easy way {{DataNodeTestUtils.waitForDiskError}} to replace this, it
runs fast and be more reliable.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message