hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Kanter (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-5566) client-side NM graceful decom doesn't trigger when jobs finish
Date Fri, 26 Aug 2016 08:08:21 GMT

     [ https://issues.apache.org/jira/browse/YARN-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Robert Kanter updated YARN-5566:
    Attachment: YARN-5566.001.patch

When digging into this, I figured that the DECOMMISSIONG node must have thought it was still
running apps, and so I added a bunch of extra print statements and saw that this was the case.
 While all RUNNING nodes had the correct counts, the DECOMMISSIONING node's count went back
up somehow.  I was looking at the state transitions, and saw that this one
          EnumSet.of(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONED),
          new StatusUpdateWhenHealthyTransition())
looked different from this analogue one
          EnumSet.of(NodeState.RUNNING, NodeState.UNHEALTHY),
          new StatusUpdateWhenHealthyTransition())
despite calling the same transition.  So I tried adding the UNHEALTHY state, and that fixed
the problem.

[~djp] any ideas what's going on here?  

> client-side NM graceful decom doesn't trigger when jobs finish
> --------------------------------------------------------------
>                 Key: YARN-5566
>                 URL: https://issues.apache.org/jira/browse/YARN-5566
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>         Attachments: YARN-5566.001.patch
> I was testing the client-side NM graceful decommission and noticed that it was always
waiting for the timeout, even if all jobs running on that node (or even the cluster) had already
> For example:
> # JobA is running with at least one container on NodeA
> # User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> NodeA
> # JobA finishes at 6:00am and there are no other jobs running on NodeA
> # User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA
> NodeA should have decommissioned at 6:00am.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message