hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1410) Handle client failover during 2 step client API's like app submission
Date Mon, 24 Feb 2014 22:30:23 GMT

    [ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910898#comment-13910898
] 

Bikas Saha commented on YARN-1410:
----------------------------------

There is considerable confusion here. I havent seen the latest code but here is my understanding
of app submission in yarn.
1) client call submitApp(). this submits the app context and returns success or failure after
initial static checks.
2) if success is returned then client call getAppReport() and waits for the app to be accepted.
If the app gets accepted, then client reports success to use that app has been successfully
submitted. Else app submission fails.

Now there can be retries in step 1) or step 2). Step 2 is idempotent. We dont need to worry
about that. Step 1) is non-idempotent. With the retry cache approach, upon retry (directly
to the same RM or to a failed over RM), a correctly working RetryCache will return the same
response as was originally sent by the RM. So if the RM returned success, RetryCache will
return success. If the RM returned immediate failure (based on static checks) then the RetryCache
will return failure. Its not clear to me why this would cause issues or why it wont work in
YARN.

The RetryCache is used for per RPC retries. It is not related to the 2-step process that we
use in YARN where each step is a different RPC request. Final success for the user is based
on the completion of both steps. RetryCache can be used to return the same RPC response for
Step 1 as many times as the client retries that same RPC request. Thats exactly what we want.
The crucial piece is storing whats needed to re-populate the RetryCache upon failover. Here,
we are piggy-backing on AppSubmissionContext storage just like HDFS piggybacks on the edit
log entry.

I hope this make things clear. [~sureshms] Does this make sense?

Side Note: 
RetryCache also has an option to store a payload along with the response. This is useful when
the response has a large internal object that is hard/expensive to re-create and can be fetched
from the RetryCache directly.


> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
>                 Key: YARN-1410
>                 URL: https://issues.apache.org/jira/browse/YARN-1410
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Xuan Gong
>         Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, YARN-1410.2.patch, YARN-1410.2.patch,
YARN-1410.3.patch, YARN-1410.4.patch, YARN-1410.5.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed over, and the
client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create app id)
the new RM may reject the app submission resulting in unexpected failure on the client side.
> The same may happen for other 2 step client API operations.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message