hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6015) Fix TestBlockRecovery#testRaceBetweenReplicaRecoveryAndFinalizeBlock
Date Tue, 25 Feb 2014 17:24:28 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911749#comment-13911749

Kihwal Lee commented on HDFS-6015:

Before HDFS-5583, the interrupted flag was not consumed before join(), so join() always threw
InterruptedException right away and it never actually worked.  I noticed unexpected early
termination of threads and found the uncleared flag to be the cause.

There are two flaws.

1) In the failing test case, the responder thread is blocked on synchronized method and the
test is calling another synchronized method before responder, blocking the responder.  Since
{{synchronized}} cannot be interrupted, responder would not terminate. Before fixing the uncleared
flag issue, the receiver would blow up right away and the synchronized method being called
by the test case would return. The blocked responder is not in the critical path of this since
join() was not actually done.  The responder eventually unblocks and terminates on its own

The correct test would either increase the test timeout to be longer than the join timeout
("dfs.datanode.xceiver.stop.timeout.millis") or set the join timeout to be shorter.

2) stopWriter() has the same join() timeout as the one used for the receiver joining on the
responder. It means that when join() times out on responder, stopWriter() will likely fail
on timeout.  A shorter timeout should be used when joining on responder.

> Fix TestBlockRecovery#testRaceBetweenReplicaRecoveryAndFinalizeBlock
> --------------------------------------------------------------------
>                 Key: HDFS-6015
>                 URL: https://issues.apache.org/jira/browse/HDFS-6015
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, ha, hdfs-client, namenode
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
> After HDFS-5583, TestBlockRecovery.testRaceBetweenReplicaRecoveryAndFinalizeBlock started
failing. It seems HDFS-5583 exposed a bug.
> When a receiver thread is interrupted, it is supposed to interrupt responder and join
on it. The join timeout is configurable.  This is not what actually happens. It was fixed
in HDFS-5583 and now the test case that depended on the broken behavior is breaking.

This message was sent by Atlassian JIRA

View raw message