hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wilfred Spiegelenburg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException
Date Mon, 08 Dec 2014 16:10:17 GMT

    [ https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238007#comment-14238007
] 

Wilfred Spiegelenburg commented on YARN-2910:
---------------------------------------------

The fix causes the {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler}}
to fail.
There is a deadlock that is created by the synchronised read access in the leaf queue for
the {{runnableApps}}. If an app has two containers at different stages in the allocation it
can happen that the {{appAttempt}} is locked by one and the {{runnableApps}} by the second
causing the hang.

This is what I was afraid of when I mentioned the slow down, I did not anticipate it this
bad but the number of reads far outnumber the writes.
The earlier proposed CopyOnWriteArrayList will also not work due to the sort that is called
(and I overlooked) which is not supported.

> FSLeafQueue can throw ConcurrentModificationException
> -----------------------------------------------------
>
>                 Key: YARN-2910
>                 URL: https://issues.apache.org/jira/browse/YARN-2910
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Ray Chiang
>         Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.004.patch, YARN-2910.1.patch,
YARN-2910.2.patch, YARN-2910.3.patch, YARN-2910.4.patch, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard ArrayList
but there is no guarantee that it will only be manipulated by one thread in the system. This
can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
ERROR IN CONTACTING RM.
> java.util.ConcurrentModificationException: java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message