hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1336) Work-preserving nodemanager restart
Date Fri, 10 Jan 2014 23:11:55 GMT

    [ https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868446#comment-13868446

Jason Lowe commented on YARN-1336:

bq. One of the scenarios for NM restart is NM config update.

If we want to update the NM configs without a restart, I think that's a separate effort. 
In theory that's not strictly necessary if the NM preserves work when it restarts, but there
will be cases where an NM restart can cause 'hiccups' (i.e.: we're mid-shuffle and have to
retry or need to retry a container launch request).

bq. There is one scenario where we want to decomm the node and would like to preserve the
state of long running tasks. For that somehow RM and AM will need to know about it so that
it can checkpoint and resume the tasks on other nodes.

Task checkpointing or task migration is not in the scope of this work.

bq. Will making ShuffleHandler be an out-of-proc help?

I'm not planning on moving the aux services out of the NM as part of this effort.  As you
point out, it's something that would be good to do to separate concerns even without the NM
restart feature. I agree the experience of aux service clients will be much smoother if they're
moved out and the scenario is we're only restarting the NM, but I don't see moving them out
as a pre-requisite to supporting basic NM restart functionality.  As such I think moving the
aux services outside the NM should be a separate JIRA that could theoretically be completed
before or after this feature.

> Work-preserving nodemanager restart
> -----------------------------------
>                 Key: YARN-1336
>                 URL: https://issues.apache.org/jira/browse/YARN-1336
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Jason Lowe
> This serves as an umbrella ticket for tasks related to work-preserving nodemanager restart.

This message was sent by Atlassian JIRA

View raw message