hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1343) NodeManagers additions/restarts are not reported as node updates in AllocateResponse responses to AMs
Date Mon, 28 Oct 2013 20:33:31 GMT

    [ https://issues.apache.org/jira/browse/YARN-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13807191#comment-13807191
] 

Alejandro Abdelnur commented on YARN-1343:
------------------------------------------

Bikas, Vinod, Sandy and I had an offline chat on this. I had as an action item to debug the
RM current behavior:

If a NM is shutdown, 'sbin/yarn-daemon.sh stop nodemanager', the RM does not receive any notification
of such. The RM will only detect the NM is gone after the {{nm.liveness-monitor.expiry-interval-ms}}
elapses (default 10mins) which triggers a {{DeactivateNodeTransition}}.

If the expired interval kicked in, then the NM is removed from the RM context and the NM rejoining
will be treated as a NM add.

If the expired interval did not kick in (the NM was restarted before the expire interval elapsed),
the NM rejoining will be treated as a NM reconnect *only* if the NM address ({{yarn.nodemanager.address}})
is using a fixed port. By default is using {{zero}}, an ephemeral port, causing the the NM
rejoin to be treated as NM add.

While using a fixed port the NM will make the NM to be treated as a reconnect, the NodeListManager
does not received the NODE_USABLE event because the {{ReconnectNodeTransition}} transition
does not dispatch a {{NodeListManagerEvent}} *(bug-1)*.

Also, it seems the {{NodeListManager}} {{unusableRMNodesConcurrentSet}} is never cleaned up.
Using NM ephemeral ports this is a memory leak. Using NM fixed ports the leak is contained
to the max number of NM *(bug-2)*.

Having AMs receiving node updates on NM added/rejoined where the NM are reported as RUNNING
seems to be, different from what the javadocs state, not the original intention of this but
seems a reasonable feature *(improvement-1)*.
 
Given than *bug-1* and *improvement-1* are tightly related I propose taking care of bothas
part of this JIRA and I'll open a new JIRA for *bug-2*.


> NodeManagers additions/restarts are not reported as node updates in AllocateResponse
responses to AMs
> -----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1343
>                 URL: https://issues.apache.org/jira/browse/YARN-1343
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Critical
>             Fix For: 2.2.1
>
>         Attachments: YARN-1343.patch
>
>
> If a NodeManager joins the cluster or gets restarted, running AMs never receive the node
update indicating the Node is running.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message