Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 81CEC11B95 for ; Fri, 16 May 2014 23:12:21 +0000 (UTC) Received: (qmail 3787 invoked by uid 500); 16 May 2014 11:51:16 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 14201 invoked by uid 500); 16 May 2014 11:42:15 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 9202 invoked by uid 99); 16 May 2014 11:23:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 May 2014 11:23:46 +0000 Date: Fri, 16 May 2014 11:23:46 +0000 (UTC) From: "Tsuyoshi OZAWA (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998626#comment-13998626 ] Tsuyoshi OZAWA commented on YARN-1367: -------------------------------------- I've read your code. The prototype is including following changes: 1. Changed NodeManager's RegisterNodeManagerRequest to send ContainerReport. 2. Added Configuration about RM_WORK_PRESERVING_RECOVERY_ENABLED. 3. Added cluster timestamp to Container Id. I think we should focus on NM should resync with the RM when the RM_WORK_PRESERVING_RECOVERY_ENABLED is set to true. Can you add resync code(ResourceManager's side code) into the patch? Also, in regard to ContainerId format, let's discuss on YARN-2052. > After restart NM should resync with the RM without killing containers > --------------------------------------------------------------------- > > Key: YARN-1367 > URL: https://issues.apache.org/jira/browse/YARN-1367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Bikas Saha > Assignee: Anubhav Dhoot > Attachments: YARN-1367.prototype.patch > > > After RM restart, the RM sends a resync response to NMs that heartbeat to it. Upon receiving the resync response, the NM kills all containers and re-registers with the RM. The NM should be changed to not kill the container and instead inform the RM about all currently running containers including their allocations etc. After the re-register, the NM should send all pending container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)