hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walter Su (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9243) TestUnderReplicatedBlocks#testSetrepIncWithUnderReplicatedBlocks test timeout
Date Fri, 23 Oct 2015 02:48:27 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970318#comment-14970318
] 

Walter Su commented on HDFS-9243:
---------------------------------

The reason is ReplicationMonitor have chosen one DataNode as target twice for the same block.

totalNodes=\{DN0, DN1, DN2\}
one block with replFactor=3, liveNodes=\{DN0\}, so block is under replicated.
For some reason, DN2 is not chosen by ReplicationMonitor.
ReplicationMonitor chose DN1 as target, schedule 1st recovery. pendingNum=1.
Later, before DN1 reported, ReplicationMonitor found liveNodes + pendingNum < replFactor,
it chose DN1 as target again, schedule 2nd recovery. pendingNum=2.

DN1 can't have 2 identical replicas. So one creplica is recovered at DN1. But liveNodes +
pendingNum = replFactor, so it wait until timeout at {{PendingReplicationBlocks}} map.

Some testCase wait liveNodes to reach replFactor, but the testCase cound't wait 5min so it
failed. Due to randomness of BlockPlacmentPolicy and relatively large number of DNs, it may
not be a problem in production.

I checked [testReport|https://builds.apache.org/job/PreCommit-HDFS-Build/13124/testReport/org.apache.hadoop.hdfs.server.blockmanagement/TestUnderReplicatedBlocks/testSetrepIncWithUnderReplicatedBlocks/]
. I saw
{noformat}
2015-10-22 15:10:18,400 [DataXceiver for client  at /127.0.0.1:47728 [Receiving block BP-57081724-67.195.81.153-1445526607531:blk_1073741825_1001]]
INFO  datanode.DataNode (DataXceiver.java:run(270)) - 127.0.0.1:47391:DataXceiver error processing
WRITE_BLOCK operation  src: /127.0.0.1:47728 dst: /127.0.0.1:47391; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
Block BP-57081724-67.195.81.153-1445526607531:blk_1073741825_1001 already exists in state
FINALIZED and thus cannot be created.
{noformat}
I've saw this before at HDFS-9275.

> TestUnderReplicatedBlocks#testSetrepIncWithUnderReplicatedBlocks test timeout
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-9243
>                 URL: https://issues.apache.org/jira/browse/HDFS-9243
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: HDFS
>            Reporter: Wei-Chiu Chuang
>            Priority: Minor
>
> org.apache.hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks sometimes time
out.
> This is happening on trunk as can be observed in several recent jenkins job. 
> (e.g. https://builds.apache.org/job/Hadoop-Hdfs-trunk/2423/  https://builds.apache.org/job/Hadoop-Hdfs-trunk/2386/
https://builds.apache.org/job/Hadoop-Hdfs-trunk/2351/ https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/472/
> On my local Linux machine, this test case times out 6 out of 10 times. When it does not
time out, this test takes about 20 seconds, otherwise it takes more than 60 seconds and then
time out.
> I suspect it's a deadlock issue, as dead lock had occurred at this test case in HDFS-5527
before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message