hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peng Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
Date Tue, 14 Jul 2015 11:25:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626207#comment-14626207
] 

Peng Zhang commented on YARN-3535:
----------------------------------

[~rohithsharma]

Thanks for rebase and adding tests.

As for removing {{recoverResourceRequestForContainer}}, in my notes, it caused test {{CapacityScheduler#testRecoverRequestAfterPreemption}}
failed. 
But I cannot remember my old thoughts:
bq. Remove call of recoverResourceRequestForContainer from preemption to avoid duplication
of recover RR.

I applied my patch {{YARN-3535-002.patch}} on our production cluster, preemption works well
with FairScheduler.

Failure of {{TestAMRestart.testAMRestartWithExistingContainers}} , I met it before. And I
think it's because:
bq. Changing TestAMRestart.java is because that case testAMRestartWithExistingContainers will
trigger this logic. After this patch, one more container may be scheduled, and attempt.getJustFinishedContainers().size()
may be bigger than expectedNum and loop never ends. So I simply change the situation.





>  ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
> ---------------------------------------------------------------------------------------------
>
>                 Key: YARN-3535
>                 URL: https://issues.apache.org/jira/browse/YARN-3535
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>            Assignee: Peng Zhang
>            Priority: Critical
>         Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch,
syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message