hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuan Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1410) Handle client failover during 2 step client API's like app submission
Date Fri, 10 Jan 2014 07:38:54 GMT

    [ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13867591#comment-13867591
] 

Xuan Gong commented on YARN-1410:
---------------------------------

Posting a patch that outlines the approach. Thanks to the AtMostOnce annotation and idempotent
annotation, we only need to do a little code changes.
* As my previous comment, we will make RM accept the appId in the context. When the failover
happens, we will re-use the old applicationId (assigned by previous active RM) to submit the
applications for current active RM
* Use AtMostOnce and idempotent annotation
** As we discussed in YARN-1521, submitApplication and getNewApplication can not be idempotent.
To make those functions retry, we can add AtMostOnce annotation. And getApplicationReport
can be marked as idempotent. 
**  I would like to add related annotations for these three apis here, because I think that
this is part of work for this ticket
* This is how application submission works. YarnClient#SubmitApplication +call+ ClientRMService#SubmitApplication
+call+ RMAppManager#SubmitApplication +create+ RMApp and submit the START Event +ReturnBack+
ClientRMService#SubmitApplication +ReturnBack+ YarnClient#SubmitApplication +CheckingAppStatus+
getApplicationReport +END+
** The failover may happen in any steps or between any steps.
** If failover happens :
*** between the time that YarnClient#SubmitApplication starts and the time that ClientRMService#SubmitApplication
is called. The YarnClient will find the next active RM, and continue to do where it left.
*** between the time that ClientRMService#SubmitApplication starts and the time that RMAppManager#SubmitApplication
is called. We will restart ClientRMService#SubmitApplication(re-run it from the first line).
At this time, the application has not been saved in zookeeper yet, so we are fine to restart
ClientRMService#SubmitApplication.
*** between the time that RMAppManager#SubmitApplication starts and the time that RMApp has
been created and START EVENT has been submitted. We will do the same thing as previous case.
*** after the time that RMApp has been created and START Event has been send out. If the failover
happens, there are several different cases:
**** after YarnClient got the SubmitApplicationResponse, but state of RMApp has not been saved
in zookeeper yet. If the failover happens, when we try to getApplicationReport, we will get
ApplicationNotFoundException. What I am doing here is to catch this exception, and call YarnClient#SubmitApplication
again.
**** after YarnClient got the SubmitApplicationResponse, and the state of RMApp has been saved
in zookeeper. If the failover happens, we do not need to do anything. 
**** before YarnClient got the SubmitApplicationResponse, and state of RMApp has not been
saved in zookeeper yet. If the failover happens, we will restart ClientRMService#SubmitApplication
at very beginning
**** before YarnClient got the SubmitApplicationResponse, but state of RMApp has been saved
in zookeeper. This is the most tricky case. If the failover happens here, we will re-run ClientRMService#SubmitApplication
at very beginning. It will try to re-submit the application with the old applicationId. But
since we have already saved this application in zookeeper, we will get a "Application with
id already exists" exception which is *not* we want.


For the last corner case, [~bikassaha], [~kkambatl] Any suggestions ?


> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
>                 Key: YARN-1410
>                 URL: https://issues.apache.org/jira/browse/YARN-1410
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Xuan Gong
>         Attachments: YARN-1410-outline.patch, YARN-1410.1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed over, and the
client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create app id)
the new RM may reject the app submission resulting in unexpected failure on the client side.
> The same may happen for other 2 step client API operations.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message