hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Íñigo Goiri (JIRA) <j...@apache.org>
Subject [jira] [Commented] (HDFS-11398) TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure still fails intermittently
Date Wed, 30 May 2018 00:26:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494522#comment-16494522
] 

Íñigo Goiri commented on HDFS-11398:
------------------------------------

This test is failing fairly consistently in trunk nowadays.
In Windows this is almost 100% failure.

> TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure still fails intermittently
> ----------------------------------------------------------------------------------------
>
>                 Key: HDFS-11398
>                 URL: https://issues.apache.org/jira/browse/HDFS-11398
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha2
>            Reporter: Yiqun Lin
>            Assignee: Yiqun Lin
>            Priority: Major
>         Attachments: HDFS-11398-reproduce.patch, HDFS-11398.001.patch, failure.log
>
>
> The test {{TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure}} still fails
intermittently in trunk after HDFS-11316. The stack infos:
> {code}
> testUnderReplicationAfterVolFailure(org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure)
 Time elapsed: 95.021 sec  <<< ERROR!
> java.util.concurrent.TimeoutException: Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-02-07 07:00:34,193
> ....
> java.lang.Thread.State: RUNNABLE
>         at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
>         at org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52)
>         at org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:511)
>         at java.lang.Thread.run(Thread.java:745)
> 	at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:276)
> 	at org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure.testUnderReplicationAfterVolFailure(TestDataNodeVolumeFailure.java:412)
> {code}
> I looked into this and found there is one chance that the vaule {{UnderReplicatedBlocksCount}}
will be no longer > 0. The following is my analysation:
> In test {{TestDataNodeVolumeFailure.testUnderReplicationAfterVolFailure}}, it uses creating
file to trigger the disk error checking. The related codes:
> {code}
>     Path file1 = new Path("/test1");
>     DFSTestUtil.createFile(fs, file1, 1024, (short)3, 1L);
>     DFSTestUtil.waitReplication(fs, file1, (short)3);
>     // Fail the first volume on both datanodes
>     File dn1Vol1 = new File(dataDir, "data"+(2*0+1));
>     File dn2Vol1 = new File(dataDir, "data"+(2*1+1));
>     DataNodeTestUtils.injectDataDirFailure(dn1Vol1, dn2Vol1);
>     Path file2 = new Path("/test2");
>     DFSTestUtil.createFile(fs, file2, 1024, (short)3, 1L);
>     DFSTestUtil.waitReplication(fs, file2, (short)3);
> {code}
> This will lead one problem: If the cluster is busy, and it costs long time to wait replication
of file2 to be desired value. During this time, the under replication blocks of file1 can
also be rereplication in cluster. If this is done, the condition {{underReplicatedBlocks >
0}} will never be  satisfied.
> And this can be reproduced in my local env.
> Actually, we can use a easy way {{DataNodeTestUtils.waitForDiskError}} to replace this,
it runs fast and be more reliable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message