hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8729) Fix testTruncateWithDataNodesRestartImmediately occasionally failed
Date Wed, 08 Jul 2015 17:47:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619035#comment-14619035

Jing Zhao commented on HDFS-8729:

To trigger block report or not before restarting DataNodes may test different code paths:
if DNs send report to NN before restarting, it is very possible that the truncate can be done
before the restarting. Otherwise the recovery process may happen after DN restarts. In these
two scenarios the block replicas reported from DN, and the block info stored in NN, can have
different states when the restarted DNs send their first block reports to NN.

In my test looks like the reason of the timeout is a race scenario in the block recovery process:
the second dn sends block report after the block truncation is finished thus its replica is
marked as corrupted. However the replication monitor cannot schedule an extra replica because
there are only 3 datanodes in the test. So maybe a quick fix is to change the total number
of DN from 3 to 4. What do you think, Walter?

> Fix testTruncateWithDataNodesRestartImmediately occasionally failed
> -------------------------------------------------------------------
>                 Key: HDFS-8729
>                 URL: https://issues.apache.org/jira/browse/HDFS-8729
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Walter Su
>            Assignee: Walter Su
>            Priority: Minor
>         Attachments: HDFS-8729.01.patch
> https://builds.apache.org/job/PreCommit-HDFS-Build/11449/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11593/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11596/testReport/
> https://builds.apache.org/job/PreCommit-HDFS-Build/11599/testReport/
> {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for /test/testTruncateWithDataNodesRestartImmediately
to reach 3 replicas
> 	at org.apache.hadoop.hdfs.DFSTestUtil.waitReplication(DFSTestUtil.java:761)
> 	at org.apache.hadoop.hdfs.server.namenode.TestFileTruncate.testTruncateWithDataNodesRestartImmediately(TestFileTruncate.java:814)
> {noformat}

This message was sent by Atlassian JIRA

View raw message