hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list
Date Wed, 04 Nov 2015 22:20:27 GMT

    [ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990565#comment-14990565

Jason Lowe commented on YARN-4311:

Thanks for the patch, Kuhu!  Test failures are related, please look into them.

In general the patch seems like a reasonable approach.  There needs to be some way for admins
to remove nodes that are no longer relevant to the cluster, and AFAIK there's no supported
way to do this short of restarting the resourcemanager.  As nodes churn in and out of the
cluster, they will simply accumulate in the decommissioned or lost nodes buckets until the
next resourcemanager restart.

My main concern is the behavior when someone botches the include list (e.g.: accidentally
truncates the includes list file and refreshes).  At that point all of the cluster nodes will
disappear from the resourcemanager with no indication of what happened (except potentially
the shutdown metric will increment by the number of nodes lost).  Today they will all go into
the decommissioned bucket, but with this patch they'll simply disappear.  This either needs
to be "as designed" behavior, or we'd have to implement a separate mechanism outside of the
include/exclude lists to direct the RM to "forget" a node.  I believe HDFS recently was changed
to behave this was as well wrt. the include/exclude lists and forgetting nodes (see HDFS-8950),
so I'm inclined to be consistent with that and say it's "as designed."

Some comments on the patch itself:

isInvalidAndAbsent doesn't have the same handling of IP's as isValidNode does.

It might also be clearer if isInvalidAndAbsent were just named isUntracked or isUntrackedNode
indicating those are nodes we aren't tracking in any way.

isInvalidAndAbsent doesn't lock hostsReader like isValidNode does.

What about refreshNodesGracefully?  That also refreshes the host include/exclude lists and
arguably needs similar logic.  We need to discuss what it means to gracefully refresh the
list when the node completely disappears from both the include and exclude list.  Should it
still gracefully decommission, and how do we make sure that node is properly tracked?  If
graceful, does it automatically disappear when the decommission completes since it's not in
either list?

Nit: While looping over the nodes, if the node is valid then there's no reason to check if
its not valid and absent.  So it could be simplified to the following:
    for (NodeId nodeId: rmContext.getRMNodes().keySet()) {
      if (!isValidNode(nodeId.getHost())) {
        RMNodeEventType nodeEventType = isInvalidAndAbsent(nodeId.getHost()) ? RMNodeEventType.SHUTDOWN
            new RMNodeEvent(nodeId, nodeEventType));

> Removing nodes from include and exclude lists will not remove them from decommissioned
nodes list
> -------------------------------------------------------------------------------------------------
>                 Key: YARN-4311
>                 URL: https://issues.apache.org/jira/browse/YARN-4311
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.1
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>         Attachments: YARN-4311-v1.patch
> In order to fully forget about a node, removing the node from include and exclude list
is not sufficient. The RM lists it under Decomm-ed nodes. The tricky part that [~jlowe] pointed
out was the case when include lists are not used, in that case we don't want the nodes to
fall off if they are not active.

This message was sent by Atlassian JIRA

View raw message