Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Tue, 14 Jul 2015 11:25:05 +0000 (UTC)
From: "Peng Zhang (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12822913.1429682519000.181009.1436873105836@Atlassian.JIRA>
In-Reply-To: <JIRA.12822913.1429682519000@Atlassian.JIRA>
References: <JIRA.12822913.1429682519000@Atlassian.JIRA>
 <JIRA.12822913.1429682519085@arcas>
Subject: [jira] [Commented] (YARN-3535)  ResourceRequest should be restored
 back to scheduler when RMContainer is killed at ALLOCATED
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626207#comment-14626207 ] 

Peng Zhang commented on YARN-3535:
----------------------------------

[~rohithsharma]

Thanks for rebase and adding tests.

As for removing {{recoverResourceRequestForContainer}}, in my notes, it caused test {{CapacityScheduler#testRecoverRequestAfterPreemption}} failed. 
But I cannot remember my old thoughts:
bq. Remove call of recoverResourceRequestForContainer from preemption to avoid duplication of recover RR.

I applied my patch {{YARN-3535-002.patch}} on our production cluster, preemption works well with FairScheduler.

Failure of {{TestAMRestart.testAMRestartWithExistingContainers}} , I met it before. And I think it's because:
bq. Changing TestAMRestart.java is because that case testAMRestartWithExistingContainers will trigger this logic. After this patch, one more container may be scheduled, and attempt.getJustFinishedContainers().size() may be bigger than expectedNum and loop never ends. So I simply change the situation.


>  ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
> ---------------------------------------------------------------------------------------------
>
>                 Key: YARN-3535
>                 URL: https://issues.apache.org/jira/browse/YARN-3535
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>            Assignee: Peng Zhang
>            Priority: Critical
>         Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)