hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart
Date Tue, 29 Jul 2014 00:15:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077178#comment-14077178

Jason Lowe commented on YARN-1354:

Thanks for taking a look, Junping!

bq. what would happen if storeApplication(), finishApplication(), removeApplication() failed
with application related information get inconsistent after restart?

If storeApplication fails then it will throw an IOException which will bubble up and fail
the container start request on the client.  As long as we're unable to store a new application,
containers for that application will not start, which I believe is the desired behavior. 
That prevents the state store from being inconsistent in this particular scenario.

If finishApplication fails then the NM will proceed as if it did succeed but the state store
will still have the application present.  This should be corrected when the NM restarts and
registers with the RM with those applications still running.  The RM should correct the situation
by telling the NM that the application has finished (see YARN-1885), and the NM will proceed
to perform application finish processing (e.g.: log aggregation, etc.).  I think worst-case
it will upload all of the app container logs again, but when it goes to rename to the final
destination name that will fail because the name already exists.  Thus there could be some
wasted work, but it should sort itself out and not do something catastrophic.

If removeApplication fails then the NM will proceed as if it did succeed but the state store
will still have the application present.  This should be corrected when the NM finishes application
processing (per above or if it was already recorded as finished) and it will again try to
remove it from the state store.  As above I think there could be some unnecessary work performed,
but I think in the end the application should eventually be removed from the NM on restart.
 It could still remain in the state store if the second removal also fails, but a subsequent
restart should behave the same.

bq. Do we need special warning if get failed on deserializing credential here?

I'm not sure how credential processing is fundamentally all that different from protocol buffer
parsing which could also fail.  If the credentials can't be read then we can't recover the
application.  Currently recovery errors are fatal to NM startup.  Do you have something specific
in mind for handling the credentials if the writable changes (e.g.: some pseudo code to show
the approach)?

> Recover applications upon nodemanager restart
> ---------------------------------------------
>                 Key: YARN-1354
>                 URL: https://issues.apache.org/jira/browse/YARN-1354
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1354-v1.patch, YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch,
YARN-1354-v3.patch, YARN-1354-v4.patch, YARN-1354-v5.patch
> The set of active applications in the nodemanager context need to be recovered for work-preserving
nodemanager restart

This message was sent by Atlassian JIRA

View raw message