hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuan Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-914) Support graceful decommission of nodemanager
Date Thu, 05 Feb 2015 04:15:36 GMT

    [ https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306602#comment-14306602

Xuan Gong commented on YARN-914:

Thanks for the proposal [~djp]

bq. RM in failed over (with HA enabled) when gracefully decommission is just triggered. We
should make sure the new active RM can carry on the action forward (how to keep sync for decommissioned
node list between active and standby RM?)

I believe this is about the configuration synchronization between multiple RM nodes. Please
take a look at https://issues.apache.org/jira/browse/YARN-1666, and https://issues.apache.org/jira/browse/YARN-1611

bq. With containers of long running services, the timeout may not help but only delay the
upgrade/reboot process. Shall we skip it and decommission directly in this case?

Do we really need to handle the "LRS containers" and "short-term containers" differently?
There are lots of different cases we need to take care. I think that we can just use the same
way to handle both.

bq. Another possibility is to track decommission timeout in RM side, instead of NM side ­
a new decommission services proposed above. Which way is better?

Maybe we need to track the timeout at RM side and NM side. RM can stop NM if the timeout is
reached but it does not receive the "decommission complete" from NM.

> Support graceful decommission of nodemanager
> --------------------------------------------
>                 Key: YARN-914
>                 URL: https://issues.apache.org/jira/browse/YARN-914
>             Project: Hadoop YARN
>          Issue Type: Improvement
>    Affects Versions: 2.0.4-alpha
>            Reporter: Luke Lu
>            Assignee: Junping Du
>         Attachments: Gracefully Decommission of NodeManager (v1).pdf
> When NMs are decommissioned for non-fault reasons (capacity change etc.), it's desirable
to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to be rescheduled
on other NMs. Further more, for finished map tasks, if their map output are not fetched by
the reducers of the job, these map tasks will need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a node manager.

This message was sent by Atlassian JIRA

View raw message