hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-914) Support graceful decommission of nodemanager
Date Tue, 23 Dec 2014 03:29:15 GMT

    [ https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256547#comment-14256547

Junping Du commented on YARN-914:

Hi [~mingma], Thanks for comments here.
bq. So YARN will reduce the capacity of the nodes as part of the decomission process until
all its map output are fetched or until all the applications the node touches have completed?
Yes. I am not sure if it is necessary for YARN to mark additional decommissioned on the node
as node's resource is already updated to 0, and no container will get chance to be allocated
on the node. Auxiliary service should still be running which shouldn't consume much resource
if no request of service.

bq. In addition, it will be interesting to understand how you handle long running jobs.
Do you mean long-running services? 
First, I think we should support a timeout in drain resources of the node (ResourceOption
already has timeout in design). So running containers should be preempted if run out of time.

Second, we should support special container tag for the long running services (some discussions
in YARN-1039) so we don't have to waste time to wait container finish until timeout. 
Third, in prospective of operation, we could add long-running label to specific nodes and
try not to do decommission on nodes with long-running tag.
Let me know if this make sense to you.

> Support graceful decommission of nodemanager
> --------------------------------------------
>                 Key: YARN-914
>                 URL: https://issues.apache.org/jira/browse/YARN-914
>             Project: Hadoop YARN
>          Issue Type: Improvement
>    Affects Versions: 2.0.4-alpha
>            Reporter: Luke Lu
>            Assignee: Junping Du
> When NMs are decommissioned for non-fault reasons (capacity change etc.), it's desirable
to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to be rescheduled
on other NMs. Further more, for finished map tasks, if their map output are not fetched by
the reducers of the job, these map tasks will need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a node manager.

This message was sent by Atlassian JIRA

View raw message