hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Badger (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6507) MiniYARNCluster.start() returns before cluster is completely started
Date Thu, 04 Feb 2016 22:33:40 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133187#comment-15133187

Eric Badger commented on MAPREDUCE-6507:

Tests are failing because of a race condition between the RM startup and the NM startup. In
each of their serviceStart() methods, they are spawning new threads to call start(), which
introduces the race. The NM is set up with a waitCount of up to 60 seconds, so that it can
wait for the cluster to complete startup (even though the start method for the RM has already
returned). Removing the threads fixes the race in the test that prompted this Jira (TestRMNMInfo),
but causes other tests to fail. Any tests that start up the MiniYARNCluster cluster without
an active RM will fail because the node managers block the main thread from transitioning
one of the RMs from standby to active. This is why the threads worked, since it allowed the
NMs to wait, while the main thread zoomed by and transitioned a standby RM to active. 

I propose changing the MiniYARNCluster start method such that it does not complete until the
cluster is completely started and to always make one RM active in HA setups. This will require
changes to the affected tests (TestRMFailover, TestMiniYARNClusterForHA, etc.), but makes
the code more understandable and removes races. The tests are only passing right now because
of excessive timeouts to mask the race that they're fighting. 

[~kasha] [~jlowe] Please advise. 

> MiniYARNCluster.start() returns before cluster is completely started
> --------------------------------------------------------------------
>                 Key: MAPREDUCE-6507
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6507
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: test
>            Reporter: Rohith Sharma K S
>            Assignee: Eric Badger
>         Attachments: MAPREDUCE-6507.001.patch
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo)  Time elapsed: 0.28 sec  <<<
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but was:<3>
> 	at org.junit.Assert.fail(Assert.java:88)
> 	at org.junit.Assert.failNotEquals(Assert.java:743)
> 	at org.junit.Assert.assertEquals(Assert.java:118)
> 	at org.junit.Assert.assertEquals(Assert.java:555)
> 	at org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}

This message was sent by Atlassian JIRA

View raw message