hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijie Shen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1336) Work-preserving nodemanager restart
Date Tue, 29 Oct 2013 03:52:33 GMT

    [ https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13807652#comment-13807652

Zhijie Shen commented on YARN-1336:

It sounds a nice feature. I thought about it a bit before: as we allow RM to restart, why
not NM? [~jlowe], do you have some writeup about the workflow of work-preserving NM restart?
If you have, would you mind sharing it? I'm curious about the design. According to the current
sub tasks, I can see that we need a NMStateStore (like RMStateStore for RM) to store the aforementioned
information when NM stops, and to recover all the states, when NM starts again. Beyond this,
how does NM contact RM and AM about its reserved work?

I've another question w.r.t this feature. How do we distinguish NM restart and shutdown? If
an NM shutdowns, and never come back, should the work still be preserved (or trapped) there?
Currently, NM will notify of killing the containers on it immediately, and the application
has the chance to start another container to do its work.

> Work-preserving nodemanager restart
> -----------------------------------
>                 Key: YARN-1336
>                 URL: https://issues.apache.org/jira/browse/YARN-1336
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
> This serves as an umbrella ticket for tasks related to work-preserving nodemanager restart.

This message was sent by Atlassian JIRA

View raw message