Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2688C200B8A for ; Sat, 24 Sep 2016 20:22:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 25310160AD1; Sat, 24 Sep 2016 18:22:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 697F6160AC2 for ; Sat, 24 Sep 2016 20:22:22 +0200 (CEST) Received: (qmail 96346 invoked by uid 500); 24 Sep 2016 18:22:21 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 96323 invoked by uid 99); 24 Sep 2016 18:22:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Sep 2016 18:22:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id CE64C2C2ABB for ; Sat, 24 Sep 2016 18:22:20 +0000 (UTC) Date: Sat, 24 Sep 2016 18:22:20 +0000 (UTC) From: "Zephyr Guo (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Reopened] (YARN-4743) ResourceManager crash because TimSort MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 24 Sep 2016 18:22:23 -0000 [ https://issues.apache.org/jira/browse/YARN-4743?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo reopened YARN-4743: ------------------------------ Sorry for some reason did not deal with this issue for a long time. Now pic= k up it, we have found the bug. Reopen the issue. > ResourceManager crash because TimSort > ------------------------------------- > > Key: YARN-4743 > URL: https://issues.apache.org/jira/browse/YARN-4743 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Affects Versions: 2.6.4 > Reporter: Zephyr Guo > Assignee: Yufei Gu > Attachments: YARN-4743-cdh5.4.7.patch > > > {code} > 2016-02-26 14:08:50,821 FATAL org.apache.hadoop.yarn.server.resourcemanag= er.ResourceManager: Error in handling event type NODE_UPDATE to the schedul= er > java.lang.IllegalArgumentException: Comparison method violates its genera= l contract! > at java.util.TimSort.mergeHi(TimSort.java:868) > at java.util.TimSort.mergeAt(TimSort.java:485) > at java.util.TimSort.mergeCollapse(TimSort.java:410) > at java.util.TimSort.sort(TimSort.java:214) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.= FSLeafQueue.assignContainer(FSLeafQueue.java:316) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.= FSParentQueue.assignContainer(FSParentQueue.java:240) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.= FairScheduler.attemptScheduling(FairScheduler.java:1091) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.= FairScheduler.nodeUpdate(FairScheduler.java:989) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.= FairScheduler.handle(FairScheduler.java:1185) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.= FairScheduler.handle(FairScheduler.java:112) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager= $SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684) > at java.lang.Thread.run(Thread.java:745) > 2016-02-26 14:08:50,822 INFO org.apache.hadoop.yarn.server.resourcemanage= r.ResourceManager: Exiting, bbye.. > {code} > Actually, this issue found in 2.6.0-cdh5.4.7. > I think the cause is that we modify {{Resouce}} while we are sorting {{ru= nnableApps}}. > {code:title=3DFSLeafQueue.java} > Comparator comparator =3D policy.getComparator(); > writeLock.lock(); > try { > Collections.sort(runnableApps, comparator); > } finally { > writeLock.unlock(); > } > readLock.lock(); > {code} > {code:title=3DFairShareComparator} > public int compare(Schedulable s1, Schedulable s2) { > ...... > s1.getResourceUsage(), minShare1); > boolean s2Needy =3D Resources.lessThan(RESOURCE_CALCULATOR, null, > s2.getResourceUsage(), minShare2); > minShareRatio1 =3D (double) s1.getResourceUsage().getMemory() > / Resources.max(RESOURCE_CALCULATOR, null, minShare1, ONE).getM= emory(); > minShareRatio2 =3D (double) s2.getResourceUsage().getMemory() > / Resources.max(RESOURCE_CALCULATOR, null, minShare2, ONE).getM= emory(); > ...... > {code} > {{getResourceUsage}} will return current Resource. The current Resource i= s unstable.=20 > {code:title=3DFSAppAttempt.java} > @Override > public Resource getResourceUsage() { > // Here the getPreemptedResources() always return zero, except in > // a preemption round > return Resources.subtract(getCurrentConsumption(), getPreemptedResour= ces()); > } > {code} > {code:title=3DSchedulerApplicationAttempt} > public Resource getCurrentConsumption() { > return currentConsumption; > } > // This method may modify current Resource. > public synchronized void recoverContainer(RMContainer rmContainer) { > ...... > Resources.addTo(currentConsumption, rmContainer.getContainer() > .getResource()); > ...... > } > {code} > I suggest that use stable Resource in comparator. > Is there something i think wrong=EF=BC=9F -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org