hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
Date Fri, 20 Jun 2014 19:33:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039235#comment-14039235

Jason Lowe commented on YARN-1341:

bq. Application state - If we failed to store the application update, i.e. from init to finish,
then we get wrong state on application after recovery.

Yes, applications should be like containers.  If we fail to store an application start in
the state store then we should fail the container launch that triggered the application to
be added.  This already happens in the current patch for YARN-1354.  If we fail to store the
completion of an application then worst-case we will report an application to the RM on restart
that isn't active, and the RM will correct the NM when it re-registers.

bq. NodeManagerMetrics - The metrics of NM will get mess up if partial updated.

I wasn't planning on persisting metrics during restart, as there are quite a few (e.g.: RPC
metrics, etc.), and I'm not sure it's critical that they be preserved across a restart.  Does
RM restart do this or are there plans to do so?

bq. About stale tag on NMStateStore - I don't mean to put on NMStateStore, but haven't think
clearly on where to do - may be we can persistent on local disk directly or send to RM and
retrieval it in NM registration?

I think in most cases the attempt to update the stale tag, even if it's separate from the
NMStateStore, will often fail in a similar way when the state store fails (e.g.: full local
disk, read-only filesystem, etc.).  Therefore I don't believe the effort to maintain a stale
tag is going to be worth it.  Also if we refuse to load a state store that's stale then we
are going to leak containers because we won't try to recover anything from a stale state store.

Instead I think we should decide in the various store failure cases whether the error should
be fatal to the operation (which may lead to it being fatal to the NM overall) or if we feel
the recovery with stale information is a better outcome than taking the NM down.  In the latter
case we should just log the error and move on.

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch,
YARN-1341v5.patch, YARN-1341v6.patch

This message was sent by Atlassian JIRA

View raw message