hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith Sharma K S (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4665) Asynch submit can lose application submissions
Date Tue, 02 Feb 2016 17:05:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128569#comment-15128569

Rohith Sharma K S commented on YARN-4665:

The scenario is valid, but it is handled at YarnClient. Is that is not sufficient?
In method YarnClientImpl#submitApplication()
}catch (ApplicationNotFoundException ex) {
        // FailOver or RM restart happens before RMStateStore saves
        // ApplicationState
        LOG.info("Re-submit application " + applicationId + "with the " +
            "same ApplicationSubmissionContext");

> Asynch submit can lose application submissions
> ----------------------------------------------
>                 Key: YARN-4665
>                 URL: https://issues.apache.org/jira/browse/YARN-4665
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.1.0-beta
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
> The change introduced in YARN-514 opens up a hole into which applications can fall and
be lost.  Prior to YARN-514, the {{submitApplication()}} call did not complete until the application
state was persisted to the state store.  After YARN-514, the {{submitApplication()}} call
is asynchronous, with the application state being saved later.
> If the state store is slow or unresponsive, it may be that an application's state may
not be persisted for quite a while.  During that time, if the RM fails (over), all applications
that have not yet been persisted to the state store will be lost without the client being
> This issue is inherent in the design of YARN-514.  I see three solutions:
> 1. Add a WAL to the state store. HBase does it, so we know how to do it. It seems like
a heavy solution to the original problem, however.  It's certainly not a trivial change.
> 2. Revert YARN-514 and update the RPC layer to allow a connection to be parked if it's
doing something that may take a while. This is a generally useful feature but could be a deep
rabbit hole.
> 3. Revert YARN-514 and add back-pressure to the job submission. For example, we set a
maximum number of threads that can simultaneously be assigned to handle job submissions. 
When that threshold is reached, new job submissions get a try-again-later response. This is
also a generally useful feature and should be a fairly constrained set of changes.  The downside
is that it impacts the API.
> I think the third option is the most approachable.  It's the smallest change, and it
adds useful behavior beyond solving the original issue.  And I don't think the API impact
is significant.

This message was sent by Atlassian JIRA

View raw message