hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1336) Work-preserving nodemanager restart
Date Tue, 22 Oct 2013 19:14:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802148#comment-13802148
] 

Jason Lowe commented on YARN-1336:
----------------------------------

Upgrading all of the nodes in a cluster for a rolling upgrade can be a very disruptive or
lengthy process.  If the nodemanager is taken down then all active containers on that node
are killed.  This is disruptive to jobs with long-running tasks, especially if one of the
tasks ends up hitting this situation across multiple attempts.  An alternative would be a
drain-decommision for nodes as proposed in YARN-914.  However with long-running applications/tasks
it can take a very long time to decommission a node, as we have to not only wait for the active
containers to complete but also active applications in general (e.g.: node still has to serve
up map task data after map task completes, so auxiliary services can have responsibilities
beyond the active containers).  Performing a rolling upgrade on a large cluster will take
a very long time if we need to wait for a clean drain-decommission of each node.

Therefore it would be nice if the nodemanager supported a mode where it could be restarted
and recover state.  This would include recovering active container state, tokens, localized
resource cache state, etc.  We could then bounce the nodemanager to an updated version without
losing containers and with minimal impact to jobs running on the grid, and the time to perform
a rolling upgrade of a large cluster would no longer be tied to the running time of applications
currently active on the cluster.

> Work-preserving nodemanager restart
> -----------------------------------
>
>                 Key: YARN-1336
>                 URL: https://issues.apache.org/jira/browse/YARN-1336
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>
> This serves as an umbrella ticket for tasks related to work-preserving nodemanager restart.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message