hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-914) Support graceful decommission of nodemanager
Date Tue, 10 Feb 2015 19:05:14 GMT

    [ https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14314677#comment-14314677

Jason Lowe commented on YARN-914:

bq. However, YARN-2567 is about threshold thing, may be a wrong JIRA number?

That's the right JIRA.  It's about waiting for a threshold number of nodes to report back
in after the RM recovers, and the RM would need to persist the state about the nodes in the
cluster to know what percentage of the old nodes have reported back in.

As for whether we should just provide hooks vs. making it much more of a turnkey solution,
I'd be an advocate for initially seeing what we can do with hooks.  Based on what we learn
with trying to do decommission with that we can provide feedback into the process of making
it a built-in, turnkey solution later.  I do agree with Vinod that there should minimally
be an easy way, CLI or otherwise, for outside scripts driving the decommission to either force
it or wait for it to complete.  If waiting, there also needs to be a way to either have the
wait have a timeout which will force after that point or another method with which to easily
kill the containers still on that node.

> Support graceful decommission of nodemanager
> --------------------------------------------
>                 Key: YARN-914
>                 URL: https://issues.apache.org/jira/browse/YARN-914
>             Project: Hadoop YARN
>          Issue Type: Improvement
>    Affects Versions: 2.0.4-alpha
>            Reporter: Luke Lu
>            Assignee: Junping Du
>         Attachments: Gracefully Decommission of NodeManager (v1).pdf
> When NMs are decommissioned for non-fault reasons (capacity change etc.), it's desirable
to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to be rescheduled
on other NMs. Further more, for finished map tasks, if their map output are not fetched by
the reducers of the job, these map tasks will need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a node manager.

This message was sent by Atlassian JIRA

View raw message