From yarn-issues-return-28660-apmail-hadoop-yarn-issues-archive=hadoop.apache.org@hadoop.apache.org Sat May 31 05:11:02 2014 Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7ECC4105D3 for ; Sat, 31 May 2014 05:11:02 +0000 (UTC) Received: (qmail 85801 invoked by uid 500); 31 May 2014 05:11:02 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 85760 invoked by uid 500); 31 May 2014 05:11:02 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 85752 invoked by uid 99); 31 May 2014 05:11:02 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 31 May 2014 05:11:02 +0000 Date: Sat, 31 May 2014 05:11:02 +0000 (UTC) From: "Jian He (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014529#comment-14014529 ] Jian He commented on YARN-1367: ------------------------------- Thanks for working on the patch. The patch needs update, can you update please ? A few initial comments: - Let's leave containerId handled in YARN-2052 separately. - The extra ContainerReport in RegisterNodeManagerRequest is not needed any more. - NM side may not need the config of work-preserving restart enabled. Given RM has this config already, RM should be able to instruct NM to keep_containers_on_resync in the case of work-preserving restart and kill_containers_on_resync in the case of non-work-preserving restart. We also avoid config overhead on each NM if doing this. > After restart NM should resync with the RM without killing containers > --------------------------------------------------------------------- > > Key: YARN-1367 > URL: https://issues.apache.org/jira/browse/YARN-1367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Bikas Saha > Assignee: Anubhav Dhoot > Attachments: YARN-1367.prototype.patch > > > After RM restart, the RM sends a resync response to NMs that heartbeat to it. Upon receiving the resync response, the NM kills all containers and re-registers with the RM. The NM should be changed to not kill the container and instead inform the RM about all currently running containers including their allocations etc. After the re-register, the NM should send all pending container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)