Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5F9CB10382 for ; Wed, 15 Jan 2014 07:13:27 +0000 (UTC) Received: (qmail 34453 invoked by uid 500); 15 Jan 2014 07:13:27 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 34265 invoked by uid 500); 15 Jan 2014 07:13:25 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 33951 invoked by uid 99); 15 Jan 2014 07:13:23 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jan 2014 07:13:23 +0000 Date: Wed, 15 Jan 2014 07:13:23 +0000 (UTC) From: "Bikas Saha (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-1410) Handle client failover during 2 step client API's like app submission MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13871751#comment-13871751 ] Bikas Saha commented on YARN-1410: ---------------------------------- Dont think I understood the failover policy wrt restart stuff. If the RM restarts or it fails over, the client will be retrying across 2 different instances of the RM and so the semantics of the operations should be the same ie the issues we are trying to identify and fix should be the same. Every problem that we have with failover, also applies to restart. Irrespective of failover, if client does submitApp() then gets and error on the network (even though RM has accepted the app). Then it retries submitApp() and the RM says app already exists. So this question is fundamental to the retry semantics of the operation. RM failover is an easy way to trigger this condition. Lets spend some time to think a solution to avoid doing a getApplication and receiving an exception before submitting the application. > Handle client failover during 2 step client API's like app submission > --------------------------------------------------------------------- > > Key: YARN-1410 > URL: https://issues.apache.org/jira/browse/YARN-1410 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Bikas Saha > Assignee: Xuan Gong > Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > App submission involves > 1) creating appId > 2) using that appId to submit an ApplicationSubmissionContext to the user. > The client may have obtained an appId from an RM, the RM may have failed over, and the client may submit the app to the new RM. > Since the new RM has a different notion of cluster timestamp (used to create app id) the new RM may reject the app submission resulting in unexpected failure on the client side. > The same may happen for other 2 step client API operations. -- This message was sent by Atlassian JIRA (v6.1.5#6160)