hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2954) Deadlock in NM with threads racing for ApplicationAttemptId
Date Thu, 08 Sep 2011 13:42:08 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100312#comment-13100312
] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-2954:
----------------------------------------------------

Because of this, the (only) MR AM had one of its map stuck in SUCCESS_CONTAINER_CLEANUP state.
On the NM, the _stopContainer()_ request from this AM was stuck on ApplicationAttemptId too.
{code}
"IPC Server handler 3 on 45450" daemon prio=10 tid=0xafd29000 nid=0x68a3 waiting for monitor
entry [0xaf00b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.getApplicationId(ApplicationAttemptIdPBImpl.java:101)
        - waiting to lock <0xb6a43ba0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
        at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:144)
        - locked <0xb6bceac8> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
        at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:31)
        at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:215)
        at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:34)
        at java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:797)
        at java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1640)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainer(ContainerManagerImpl.java:311)
        at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagerPBServiceImpl.stopContainer(ContainerManagerPBServiceImpl.java:80)
        at org.apache.hadoop.yarn.proto.ContainerManager$ContainerManagerService$2.callBlockingMethod(ContainerManager.java:85)
        at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:337)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1496)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1492)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1490)
{code}

So because of this, the map got stuck, all the reducers were spinning for TaskCompletionEvents
and the world came to a halt.

> Deadlock in NM with threads racing for ApplicationAttemptId
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-2954
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2954
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>             Fix For: 0.23.0
>
>
> Found this:
> {code}
> Java stack information for the threads listed above:
> ===================================================
> "Thread-45":
>         at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.getApplicationId(ApplicationAttemptIdPBImpl.java:101)
>         - waiting to lock <0xb6a43ba0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
>         at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:144)
>         - locked <0xb6a443a0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
>         at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:31)
>         at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:215)
>         at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:34)
>         at java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:797)
>         at java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1640)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:360)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:355)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:113)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> "Thread-30":
>         at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.getApplicationId(ApplicationAttemptIdPBImpl.java:101)
>         - waiting to lock <0xb6a443a0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
>         at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:144)
>         - locked <0xb6a43ba0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
>         at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:31)
>         at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:215)
>         at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:34)
>         at java.util.concurrent.ConcurrentSkipListMap.doRemove(ConcurrentSkipListMap.java:1078)
>         at java.util.concurrent.ConcurrentSkipListMap.remove(ConcurrentSkipListMap.java:1673)
>         at java.util.concurrent.ConcurrentSkipListMap$Iter.remove(ConcurrentSkipListMap.java:2256)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:223)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:62)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:262)
> Found 1 deadlock.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message