hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zephyr Guo (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-4743) ResourceManager crash because TimSort
Date Sat, 27 Feb 2016 09:13:18 GMT
Zephyr Guo created YARN-4743:
--------------------------------

             Summary: ResourceManager crash because TimSort
                 Key: YARN-4743
                 URL: https://issues.apache.org/jira/browse/YARN-4743
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.6.4
            Reporter: Zephyr Guo


{code}
2016-02-26 14:08:50,821 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Error in handling event type NODE_UPDATE to the scheduler
java.lang.IllegalArgumentException: Comparison method violates its general contract!
         at java.util.TimSort.mergeHi(TimSort.java:868)
         at java.util.TimSort.mergeAt(TimSort.java:485)
         at java.util.TimSort.mergeCollapse(TimSort.java:410)
         at java.util.TimSort.sort(TimSort.java:214)
         at java.util.TimSort.sort(TimSort.java:173)
         at java.util.Arrays.sort(Arrays.java:659)
         at java.util.Collections.sort(Collections.java:217)
         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
         at java.lang.Thread.run(Thread.java:745)
2016-02-26 14:08:50,822 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Exiting, bbye..
{code}

Actually, this issue found in 2.6.0-cdh5.4.7.
I think the cause is that we modify {{Resouce}} while we are sorting {{runnableApps}}.
{code:title=FSLeafQueue.java}
    Comparator<Schedulable> comparator = policy.getComparator();
    writeLock.lock();
    try {
      Collections.sort(runnableApps, comparator);
    } finally {
      writeLock.unlock();
    }
    readLock.lock();
{code}

{code:title=FairShareComparator}
public int compare(Schedulable s1, Schedulable s2) {
......
          s1.getResourceUsage(), minShare1);
      boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
          s2.getResourceUsage(), minShare2);
      minShareRatio1 = (double) s1.getResourceUsage().getMemory()
          / Resources.max(RESOURCE_CALCULATOR, null, minShare1, ONE).getMemory();
      minShareRatio2 = (double) s2.getResourceUsage().getMemory()
          / Resources.max(RESOURCE_CALCULATOR, null, minShare2, ONE).getMemory();
......
{code}
{{getResourceUsage}} will return current Resource. The current Resource is unstable. 
{code:title=FSAppAttempt.java}
@Override
  public Resource getResourceUsage() {
    // Here the getPreemptedResources() always return zero, except in
    // a preemption round
    return Resources.subtract(getCurrentConsumption(), getPreemptedResources());
  }
{code}
{code:title=SchedulerApplicationAttempt}
 public Resource getCurrentConsumption() {
    return currentConsumption;
  }

// This method may modify current Resource.
public synchronized void recoverContainer(RMContainer rmContainer) {
......
    Resources.addTo(currentConsumption, rmContainer.getContainer()
      .getResource());
......
  }
{code}
I suggest that use stable Resource in comparator.

Is there something i think wrong´╝č



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message