hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
Date Wed, 04 Nov 2015 16:39:27 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989859#comment-14989859
] 

Kihwal Lee commented on HDFS-4937:
----------------------------------

First of all, the precommit build ran 4,075 test cases, so I think it ran all of them this
time.

The test failures are not related to the patch. I've rerun the failed tests and only {{TestSeveralNameNodes}}
were failing occasionally. It was timing out waiting for a thread to finish writing. This
test has been failing in other precommit builds as well. When I increase the timeout, it passed
100% of times.  I will file a jira for this.

{panel}
-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed
in 8.0
Running org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 62.298 sec - in org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed
in 8.0
Running org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.295 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed
in 8.0
Running org.apache.hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 157.484 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed
in 8.0
Running org.apache.hadoop.hdfs.TestLeaseRecovery2
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 73.445 sec - in org.apache.hadoop.hdfs.TestLeaseRecovery2
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed
in 8.0
Running org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 98.315 sec - in org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed
in 8.0
Running org.apache.hadoop.hdfs.TestCrcCorruption
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 30.387 sec - in org.apache.hadoop.hdfs.TestCrcCorruption
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed
in 8.0
Running org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.775 sec - in org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser
{panel}

> ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-4937
>                 URL: https://issues.apache.org/jira/browse/HDFS-4937
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.0.4-alpha, 0.23.8
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>              Labels: BB2015-05-TBR
>         Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch,
HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the network topology
is updated. If the refresh happens at the right moment, the replication monitor thread may
stuck in the while loop of {{chooseRandom()}}. This is because the cached cluster size is
used in the terminal condition check of the loop. This usually happens when a block with a
high replication factor is being processed. Since replicas/rack is also calculated beforehand,
no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less than the
cached cluster size, so it will loop infinitely. This was observed in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message