hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
Date Thu, 10 Nov 2011 03:33:51 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147471#comment-13147471
] 

Siddharth Seth commented on MAPREDUCE-3355:
-------------------------------------------

There's another extremely unlikely situation which could cause this. 
Canceling the timer doesn't affect the timer task if it's already started. An interrupt could
come in anytime after the cancel - which could interrupt the TA_CONTAINER_CLEANED event or
the ContainerLaunchedEvent. This would be a combination of startContainer finishing around
when the timer expires + some very specific thread scheduling. Also if the start/stopContainer
were to complete around the same time as when the timer kicks in.
Possible fix would be to synchronize in the main task on the CommandTimer when we don't care
about interrupts, and always synchronize the CommandTimer on itself.

                
> AM scheduling hangs frequently with sort job on 350 nodes
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3355
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3355-20111109.1.txt, MAPREDUCE-3355-20111109.txt
>
>
> Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 node cluster.
Found this in AM logs:
> {code}
> Exception in thread "ContainerLauncher #60" org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException
>             at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170)
>             at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379)
>             at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>             at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>             at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.InterruptedException
>             at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199)
>             at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312)
>             at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294)
>             at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168)
>             ... 4 more
> Exception in thread "ContainerLauncher #53" org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException
>             at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170)
>             at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405)
>             at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330)
>             at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>             at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>             at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.InterruptedException
>             at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199)
>             at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312)
>             at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294)
>             at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168)
>             ... 5 more
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message