hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3102) Decommisioned Nodes not listed in Web UI
Date Tue, 26 Jan 2016 21:25:40 GMT

    [ https://issues.apache.org/jira/browse/YARN-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118067#comment-15118067

Jason Lowe commented on YARN-3102:

Test failures are unrelated and tracked elsewhere.

Patch looks pretty good, but there are some nits:

Not caused by this JIRA, but the node rejoin transition has the contains-followed-by-get problem.
 If another thread removes the node after the containsKey call then the get will return null
but the code doesn't handle that and will NPE.  It is safer and faster to just do the get
then check for null rather than perform the key lookup twice.  This also avoids the awkward
declaring of the previousRMNode variable with no initial value.

Similarly we should not call get-then-remove when checking for the unknown node.  Instead
we can simply call remove on the unknown node ID and check if the remove returned anything
to know if it was there.

In TestResourceTrackerService, wouldn't it be simpler to create an overloaded form of writeToHostsFile
that takes the specified file rather than replacing the existing method?  Then we can implement
the original method in terms of the new method and cut out a large portion of this patch where
it has to fixup all the original calls of the removed method.

4 seconds seems pretty aggressive for a test timeout, especially with multiple RM bringups
and teardowns involved.  If this runs on a sluggish jenkins machine that happens to pause
at the wrong time then the test fails -- i.e.: is it important that this test fails if it
executes in 5 seconds instead?  Seems like the timeout should be at least 20 seconds, if there
is an explicit timeout specified at all (surefire has one built in).

Should we just change MockRM to use the drain dispatcher and expose a drain events method
rather than fix a bunch of places to override which dispatcher to use?

> Decommisioned Nodes not listed in Web UI
> ----------------------------------------
>                 Key: YARN-3102
>                 URL: https://issues.apache.org/jira/browse/YARN-3102
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>         Environment: 2 Node Manager and 1 Resource Manager 
>            Reporter: Bibin A Chundatt
>            Assignee: Kuhu Shukla
>            Priority: Minor
>         Attachments: YARN-3102-v1.patch, YARN-3102-v2.patch, YARN-3102-v3.patch, YARN-3102-v4.patch,
YARN-3102-v5.patch, YARN-3102-v6.patch
> Configure yarn.resourcemanager.nodes.exclude-path in yarn-site.xml to yarn.exlude file
In RM1 machine
> Add Yarn.exclude with NM1 Host Name 
> Start the node as listed below NM1,NM2 Resource manager
> Now check Nodes decommisioned in /cluster/nodes
> Number of decommisioned node is listed as 1 but Table is empty in /cluster/nodes/decommissioned
(detail of Decommision node not shown)

This message was sent by Atlassian JIRA

View raw message