hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Templeton (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-4665) Asynch submit can lose application submissions
Date Tue, 02 Feb 2016 17:29:40 GMT

     [ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Templeton updated YARN-4665:
-----------------------------------
       Priority: Major  (was: Critical)
    Description: 
The change introduced in YARN-514 opens up a hole into which applications can fall and be
lost.  Prior to YARN-514, the {{submitApplication()}} call did not complete until the application
state was persisted to the state store.  After YARN-514, the {{submitApplication()}} call
is asynchronous, with the application state being saved later.

If the state store is slow or unresponsive, it may be that an application's state may not
be persisted for quite a while.  During that time, if the RM fails (over), all applications
that have not yet been persisted to the state store will be lost.  If the active RM loses
ZK connectivity, a significant number of job submissions can pile up before the ZK connection
times out, resulting in a large pile of client failures when it finally does.

This issue is inherent in the design of YARN-514.  I see three solutions:

1. Add a WAL to the state store. HBase does it, so we know how to do it. It seems like a heavy
solution to the original problem, however.  It's certainly not a trivial change.

2. Revert YARN-514 and update the RPC layer to allow a connection to be parked if it's doing
something that may take a while. This is a generally useful feature but could be a deep rabbit
hole.

3. Revert YARN-514 and add back-pressure to the job submission. For example, we set a maximum
number of threads that can simultaneously be assigned to handle job submissions.  When that
threshold is reached, new job submissions get a try-again-later response. This is also a generally
useful feature and should be a fairly constrained set of changes.

I think the third option is the most approachable.  It's the smallest change, and it adds
useful behavior beyond solving the original issue.

  was:
The change introduced in YARN-514 opens up a hole into which applications can fall and be
lost.  Prior to YARN-514, the {{submitApplication()}} call did not complete until the application
state was persisted to the state store.  After YARN-514, the {{submitApplication()}} call
is asynchronous, with the application state being saved later.

If the state store is slow or unresponsive, it may be that an application's state may not
be persisted for quite a while.  During that time, if the RM fails (over), all applications
that have not yet been persisted to the state store will be lost without the client being
aware.

This issue is inherent in the design of YARN-514.  I see three solutions:

1. Add a WAL to the state store. HBase does it, so we know how to do it. It seems like a heavy
solution to the original problem, however.  It's certainly not a trivial change.

2. Revert YARN-514 and update the RPC layer to allow a connection to be parked if it's doing
something that may take a while. This is a generally useful feature but could be a deep rabbit
hole.

3. Revert YARN-514 and add back-pressure to the job submission. For example, we set a maximum
number of threads that can simultaneously be assigned to handle job submissions.  When that
threshold is reached, new job submissions get a try-again-later response. This is also a generally
useful feature and should be a fairly constrained set of changes.  The downside is that it
impacts the API.

I think the third option is the most approachable.  It's the smallest change, and it adds
useful behavior beyond solving the original issue.  And I don't think the API impact is significant.


[~rohithsharma], I've updated the description to be more specific about the issue.  Yes, the
client will keep polling, and it fails cleanly when the RM fails, but the RM failure is not
always clean, leaving the potential for huge numbers of client failures.

> Asynch submit can lose application submissions
> ----------------------------------------------
>
>                 Key: YARN-4665
>                 URL: https://issues.apache.org/jira/browse/YARN-4665
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.1.0-beta
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>
> The change introduced in YARN-514 opens up a hole into which applications can fall and
be lost.  Prior to YARN-514, the {{submitApplication()}} call did not complete until the application
state was persisted to the state store.  After YARN-514, the {{submitApplication()}} call
is asynchronous, with the application state being saved later.
> If the state store is slow or unresponsive, it may be that an application's state may
not be persisted for quite a while.  During that time, if the RM fails (over), all applications
that have not yet been persisted to the state store will be lost.  If the active RM loses
ZK connectivity, a significant number of job submissions can pile up before the ZK connection
times out, resulting in a large pile of client failures when it finally does.
> This issue is inherent in the design of YARN-514.  I see three solutions:
> 1. Add a WAL to the state store. HBase does it, so we know how to do it. It seems like
a heavy solution to the original problem, however.  It's certainly not a trivial change.
> 2. Revert YARN-514 and update the RPC layer to allow a connection to be parked if it's
doing something that may take a while. This is a generally useful feature but could be a deep
rabbit hole.
> 3. Revert YARN-514 and add back-pressure to the job submission. For example, we set a
maximum number of threads that can simultaneously be assigned to handle job submissions. 
When that threshold is reached, new job submissions get a try-again-later response. This is
also a generally useful feature and should be a fairly constrained set of changes.
> I think the third option is the most approachable.  It's the smallest change, and it
adds useful behavior beyond solving the original issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message