Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 23078178A7 for ; Tue, 14 Jul 2015 11:25:09 +0000 (UTC) Received: (qmail 62196 invoked by uid 500); 14 Jul 2015 11:25:05 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 62154 invoked by uid 500); 14 Jul 2015 11:25:05 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 62140 invoked by uid 99); 14 Jul 2015 11:25:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jul 2015 11:25:05 +0000 Date: Tue, 14 Jul 2015 11:25:05 +0000 (UTC) From: "Peng Zhang (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626207#comment-14626207 ] Peng Zhang commented on YARN-3535: ---------------------------------- [~rohithsharma] Thanks for rebase and adding tests. As for removing {{recoverResourceRequestForContainer}}, in my notes, it caused test {{CapacityScheduler#testRecoverRequestAfterPreemption}} failed. But I cannot remember my old thoughts: bq. Remove call of recoverResourceRequestForContainer from preemption to avoid duplication of recover RR. I applied my patch {{YARN-3535-002.patch}} on our production cluster, preemption works well with FairScheduler. Failure of {{TestAMRestart.testAMRestartWithExistingContainers}} , I met it before. And I think it's because: bq. Changing TestAMRestart.java is because that case testAMRestartWithExistingContainers will trigger this logic. After this patch, one more container may be scheduled, and attempt.getJustFinishedContainers().size() may be bigger than expectedNum and loop never ends. So I simply change the situation. > ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED > --------------------------------------------------------------------------------------------- > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Peng Zhang > Assignee: Peng Zhang > Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)