hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
Date Thu, 19 Jun 2014 20:55:24 GMT

    [ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037868#comment-14037868

Junping Du commented on YARN-1341:

bq.  Restarts should be rare, and I'd rather not force a loss of work by taking the NM down
instantly when the state store hiccups.
Yes. But considering rolling upgrade case, it (restart) should be much often than failed in
state store (Correct me here if I am wrong as I am not levelDB expert). In this case, we always
look forward to some work loss as even if we don't bring NM down now, we will suffer after
NM restart in upgrade.

bq.  If the state store is missing some things, we might not be able to recover a localized
resource, a token, a container, or possibly anything at all.
I am not worrying losing them all, but if we can only partially recover these, would it become
a problem and break some assumptions we have? I don't know. But this seems to make things
more complicated.

bq.  in the worst-case, the state store is so corrupted on startup that we don't even survive
the NM restart and the NM crashes, which would have an end result just like if we took it
down when the state store failed.
I am not sure if this is the worst case. The worst case seems to me is: NM restart with partial
state recovered, this inconsistent state is not aware by running containers which could bring
some weird bugs. I am not sure how possible it could happen here, please 

bq.  Therefore I'd rather not guarantee that we'll lose work by crashing the NM on any store
error and instead try to preserve the work we have. The NM could theoretically recover (e.g.:
if the error is transient then the next RM key store could succeed). If we take the NM down
immediately then we're guaranteeing the work is lost. Is that really better?
I think it is better to guarantee the work get lost as the expectation to user is consistent.
We don't know when new Token from RM come to refresh to stale one to make persevering work
succeed in lucky. User shouldn't expect work still get preserved after NM restart if state
store get failed sometime.

bq. May be a better approach is to have errors like this trigger an unhealthy state for the
NM when we have the ability to do a graceful decommission. 
I agree. This could be a better approach.

In overall, I agree that we can keep log error here without breaking NM down (or we will have
change previous code on update localizedResources/deletionServices) for reason you specified
above. However, to get rid of loading inconsistent state and manage user's expectation. I
think we shouldn't allow the state get loaded again if get some failure before in store. May
be we add some stale tag on NMStateStore and mark this when store failure happens and never
load a staled store. [~jlowe], what do you think?

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch,
YARN-1341v5.patch, YARN-1341v6.patch

This message was sent by Atlassian JIRA

View raw message