Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AA8EC10696 for ; Tue, 7 Jan 2014 16:57:49 +0000 (UTC) Received: (qmail 90654 invoked by uid 500); 7 Jan 2014 16:56:35 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 90426 invoked by uid 500); 7 Jan 2014 16:56:16 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 90279 invoked by uid 99); 7 Jan 2014 16:55:55 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jan 2014 16:55:55 +0000 Date: Tue, 7 Jan 2014 16:55:55 +0000 (UTC) From: "Bikas Saha (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864382#comment-13864382 ] Bikas Saha commented on YARN-1489: ---------------------------------- The POR is the attempt AMRM register RPC to return the currently running containers for that app. So when the attempt makes the initial sync with the RM then it will get all that info. > [Umbrella] Work-preserving ApplicationMaster restart > ---------------------------------------------------- > > Key: YARN-1489 > URL: https://issues.apache.org/jira/browse/YARN-1489 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Vinod Kumar Vavilapalli > Assignee: Vinod Kumar Vavilapalli > Attachments: Work preserving AM restart.pdf > > > Today if AMs go down, > - RM kills all the containers of that ApplicationAttempt > - New ApplicationAttempt doesn't know where the previous containers are running > - Old running containers don't know where the new AM is running. > We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.5#6160)