hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omkar Vinit Joshi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1422) RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing
Date Tue, 19 Nov 2013 00:19:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825988#comment-13825988
] 

Omkar Vinit Joshi commented on YARN-1422:
-----------------------------------------

Yes this looks to be a problem.
check this [synchronization locking problem | https://issues.apache.org/jira/browse/YARN-897?focusedCommentId=13706284&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13706284]
The ordering always should be from root to leaf queue. I think there can be other places too
where this ordering is mixed. 

> RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container
is completing
> ----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1422
>                 URL: https://issues.apache.org/jira/browse/YARN-1422
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler, resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Adam Kawa
>            Priority: Critical
>
> If getQueueUserAclInfo() on a parent/root queue (e.g. via CapacityScheduler.getQueueUserAclInfo)
is called, and a container is completing, then the ResourceManager can deadlock. 
> It is similar to https://issues.apache.org/jira/browse/YARN-325. 
> *More details:*
> * Thread A
> 1) In a synchronized block of code (a lockid 0x00000000c18d8870=LeafQueue.class), LeafQueue.completedContainer
wants to inform the parent queue that a container is being completed and invokes ParentQueue.completedContainer
method.
> 3) The ParentQueue.completedContainer waits to aquire a lock on itself (a lockid 0x00000000c1846350=ParentQueue.class)
to go to synchronized block of code. It can not accuire this lock, because Thread B already
has this lock.
> * Thread B
> 0) A moment earlier, CapacityScheduler.getQueueUserAclInfo is called. This method invokes
a synchronized method on ParentQueue.class i.e. ParentQueue.getQueueUserAclInfo (a lockid
0x00000000c1846350=ParentQueue.class) and aquires the lock that Thread A will be waiting for.

> 2) Unluckyly, ParentQueue.getQueueUserAclInfo iterates over children queue acls and it
wants to run a synchonized method, LeafQueue.getQueueUserAclInfo, but it does not have a lock
on LeafQueue.class (a lockid 0x00000000c18d8870=LeafQueue.class). This lock is already held
by LeafQueue.completedContainer in Thread A.
> The order that causes the deadlock: B0 -> A1 -> B2 -> A3.
> *Java Stacktrace*
> {code}
> Found one Java-level deadlock:
> =============================
> "1956747953@qtp-109760451-1959":
>   waiting to lock monitor 0x00000000434e10c8 (object 0x00000000c1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
>   which is held by "IPC Server handler 39 on 8032"
> "IPC Server handler 39 on 8032":
>   waiting to lock monitor 0x00000000422bbc58 (object 0x00000000c18d8870, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue),
>   which is held by "ResourceManager Event Processor"
> "ResourceManager Event Processor":
>   waiting to lock monitor 0x00000000434e10c8 (object 0x00000000c1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
>   which is held by "IPC Server handler 39 on 8032"
> Java stack information for the threads listed above:
> ===================================================
> "1956747953@qtp-109760451-1959":
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getUsedCapacity(ParentQueue.java:276)
> 	- waiting to lock <0x00000000c1846350> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> 	at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.<init>(CapacitySchedulerInfo.java:49)
> 	at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:203)
> 	at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
> 	at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
> 	at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
> 	at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> 	at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
> 	at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
> 	at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
> 	at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> 	at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
> 	at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:76)
> 	at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
> 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> 	at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
> 	at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
> 	at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
> 	at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
> 	at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
> 	at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
> 	at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
> 	at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
> 	at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> 	at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
> 	at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
> 	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> 	at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
> 	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> 	at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081)
> 	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> 	at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> 	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> 	at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> 	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> 	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> 	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> 	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> 	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> 	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> 	at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> 	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> 	at org.mortbay.jetty.Server.handle(Server.java:326)
> 	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> 	at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> 	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> 	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> 	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> 	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
> 	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> "IPC Server handler 39 on 8032":
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueUserAclInfo(LeafQueue.java:544)
> 	- waiting to lock <0x00000000c18d8870> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueUserAclInfo(ParentQueue.java:351)
> 	- locked <0x00000000c1846350> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueUserAclInfo(CapacityScheduler.java:622)
> 	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:517)
> 	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueUserAcls(ApplicationClientProtocolPBServiceImpl.java:225)
> 	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:255)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
> "ResourceManager Event Processor":
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.completedContainer(ParentQueue.java:693)
> 	- waiting to lock <0x00000000c1846350> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1460)
> 	- locked <0x00000000c18d8870> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:838)
> 	- locked <0x00000000c1846310> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:648)
> 	- locked <0x00000000c1846310> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:734)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86)
> 	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> 	at java.lang.Thread.run(Thread.java:662)
> Found 1 deadlock.
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message