hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yufei Gu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-4743) ResourceManager crash because TimSort
Date Wed, 28 Sep 2016 09:07:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528957#comment-15528957
] 

Yufei Gu edited comment on YARN-4743 at 9/28/16 9:06 AM:
---------------------------------------------------------

Hi [~gzh1992n], thanks for working on this. Some thoughts about the patch. Both 0.5(weight
value less than 1.0) or 0.0 are valid value for weights in fair scheduler. Once use case of
zero-weight would be that user uses the zero-weight queue to run jobs when there is no jobs
for other non-zero-weight queues. So it make no sense to me to enforce weight larger than
1.0. 
If NaN affects the transitive, we can avoid NaN by other ways. For example, if the first weight
is 0.0 and the second is bigger than 0.0, obviously, the second one is needier than the first
one.


was (Author: yufeigu):
Hi [~gzh1992n], thanks for working on this. Some thoughts about the patch. Both 0.5(weight
value less than 1.0) or 0.0 are valid value for weights in fair scheduler. Once use case of
zero-weight would be that user uses the zero-weight queue to run jobs when there is no jobs
for other non-zero-weight queues. So it make no sense to me to enforce weight larger than
1.0. 

> ResourceManager crash because TimSort
> -------------------------------------
>
>                 Key: YARN-4743
>                 URL: https://issues.apache.org/jira/browse/YARN-4743
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.4
>            Reporter: Zephyr Guo
>             Fix For: 3.0.0-alpha1
>
>         Attachments: YARN-4743-v1.patch, YARN-CDH5.4.7.patch, timsort.log
>
>
> {code}
> 2016-02-26 14:08:50,821 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Error in handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general contract!
>          at java.util.TimSort.mergeHi(TimSort.java:868)
>          at java.util.TimSort.mergeAt(TimSort.java:485)
>          at java.util.TimSort.mergeCollapse(TimSort.java:410)
>          at java.util.TimSort.sort(TimSort.java:214)
>          at java.util.TimSort.sort(TimSort.java:173)
>          at java.util.Arrays.sort(Arrays.java:659)
>          at java.util.Collections.sort(Collections.java:217)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>          at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>          at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting {{runnableApps}}.
> {code:title=FSLeafQueue.java}
>     Comparator<Schedulable> comparator = policy.getComparator();
>     writeLock.lock();
>     try {
>       Collections.sort(runnableApps, comparator);
>     } finally {
>       writeLock.unlock();
>     }
>     readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ......
>           s1.getResourceUsage(), minShare1);
>       boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
>           s2.getResourceUsage(), minShare2);
>       minShareRatio1 = (double) s1.getResourceUsage().getMemory()
>           / Resources.max(RESOURCE_CALCULATOR, null, minShare1, ONE).getMemory();
>       minShareRatio2 = (double) s2.getResourceUsage().getMemory()
>           / Resources.max(RESOURCE_CALCULATOR, null, minShare2, ONE).getMemory();
> ......
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is unstable.

> {code:title=FSAppAttempt.java}
> @Override
>   public Resource getResourceUsage() {
>     // Here the getPreemptedResources() always return zero, except in
>     // a preemption round
>     return Resources.subtract(getCurrentConsumption(), getPreemptedResources());
>   }
> {code}
> {code:title=SchedulerApplicationAttempt}
>  public Resource getCurrentConsumption() {
>     return currentConsumption;
>   }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ......
>     Resources.addTo(currentConsumption, rmContainer.getContainer()
>       .getResource());
> ......
>   }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i think wrong´╝č



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message