hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3031) Job Client goes into infinite loop when we kill AM
Date Mon, 19 Sep 2011 13:31:09 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107813#comment-13107813
] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3031:
----------------------------------------------------

This is a bug in NM and just about any container which is killed like this(doing a kill $pid
on the node) will be stuck at RUNNING state on the RM. I found this on the corresponding NM:

{code}
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_KILLED_ON_REQUEST
at RUNNING
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:297)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:39)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:439)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:685)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:69)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:356)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:349)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:113)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
        at java.lang.Thread.run(Thread.java:619)
{code}

This is because an exit code of 137/143 is treated as a kill request. On hind sight it turns
out this is a bad idea, we should fix this.

> Job Client goes into infinite loop when we kill AM
> --------------------------------------------------
>
>                 Key: MAPREDUCE-3031
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3031
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Karam Singh
>             Fix For: 0.23.0
>
>
> Started a cluster. Submitted a sleep job with around 10000 maps and 1000 reduces.
> Killed AM with kill -9 by which time already 7000 thousands maps got completed.
> On the RM webUI, Application is stuck in Application.RUNNING state. And JobClient goes
into an infinite loop as RM keeps telling the client that the application is running.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message