hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yiqun Lin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11353) Improve the unit tests relevant to DataNode volume failure testing
Date Sat, 28 Jan 2017 03:13:25 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Yiqun Lin updated HDFS-11353:
    Attachment: HDFS-11353.005.patch

Thanks [~xiaochen] for taking a look for this and giving your comments. The comments seem
Attach a new patch to address the comments. I add the timeout {{@Rule}} in class {{TestDataNodeVolumeFailureToleration}}
as well since I found {{TestDataNodeVolumeFailureToleration}} failed sometimes also. I set
the timeout as {{120s}} as you mentioned in HDFS-11372 and this will be a sufficient time.
I took a look in the recent Jenkins buildings, the relevant tests just cost around 1~2minutes.
TestDataNodeVolumeFailure	1 分 7 秒	0	-1	0		10	+1	10	
TestDataNodeVolumeFailureReporting	1 分 35 秒	0		0		6	+6	6	+6
TestDataNodeVolumeFailureToleration	43 秒	0		0		4		4
If the test still fails, we will be easily caught  and can file the new JIRA to have a track.
Thanks for the review.

> Improve the unit tests relevant to DataNode volume failure testing
> ------------------------------------------------------------------
>                 Key: HDFS-11353
>                 URL: https://issues.apache.org/jira/browse/HDFS-11353
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0-alpha2
>            Reporter: Yiqun Lin
>            Assignee: Yiqun Lin
>         Attachments: HDFS-11353.001.patch, HDFS-11353.002.patch, HDFS-11353.003.patch,
HDFS-11353.004.patch, HDFS-11353.005.patch
> Currently there are many tests which start with {{TestDataNodeVolumeFailure*}} frequently
run timedout or failed. I found one failure test in recent Jenkins building. The stack info:
> {code}
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures
> java.util.concurrent.TimeoutException: Timed out waiting for DN to die
> 	at org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
> 	at org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:208)
> {code}
> The related codes:
> {code}
>     /*
>      * Now fail the 2nd volume on the 3rd datanode. All its volumes
>      * are now failed and so it should report two volume failures
>      * and that it's no longer up. Only wait for two replicas since
>      * we'll never get a third.
>      */
>     DataNodeTestUtils.injectDataDirFailure(dn3Vol2);
>     Path file3 = new Path("/test3");
>     DFSTestUtil.createFile(fs, file3, 1024, (short)3, 1L);
>     DFSTestUtil.waitReplication(fs, file3, (short)2);
>     // The DN should consider itself dead
>     DFSTestUtil.waitForDatanodeDeath(dns.get(2));
> {code}
> Here the code waits for the datanode failed all the volume and then become dead. But
it timed out. We would be better to compare that if all the volumes are failed then wair for
the datanode dead.
> In addition, we can use the method {{checkDiskErrorSync}} to do the disk error check
instead of creaing files. In this JIRA, I would like to extract this logic and defined that
in {{DataNodeTestUtils}}. And then we can reuse this method for datanode volme failure testing
in the future.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message