hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Roberts (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-212) NM state machine ignores an APPLICATION_CONTAINER_FINISHED event when it shouldn't
Date Mon, 12 Nov 2012 20:55:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13495622#comment-13495622
] 

Nathan Roberts commented on YARN-212:
-------------------------------------

The interesting parts of the logs are:

2012-11-07 05:36:33,754 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Application application_1351873505780_75229 transitioned from NEW to INITING
2012-11-07 05:36:33,754 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Adding container_1351873505780_75229_01_000004 to application application_1351873505780_75229
2012-11-07 05:36:33,760 [Node Status Updater] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
Sending out status for container: container_id {, app_attempt_id {, application_id {, id:
75229, cluster_timestamp: 1351873505780, }, attemptId: 1, }, id: 4, }, state: C_RUNNING, diagnostics:
"", exit_status: -1000, 
2012-11-07 05:36:33,774 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1351873505780_75229_01_000004 transitioned from NEW to DONE
2012-11-07 05:36:33,774 [AsyncDispatcher event handler] WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APPLICATION_CONTAINER_FINISHED
at INITING
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:404)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:60)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:570)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:562)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
        at java.lang.Thread.run(Thread.java:619)
2012-11-07 05:36:33,774 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Application application_1351873505780_75229 transitioned from INITING to null
2012-11-07 05:36:33,775 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
Considering container container_1351873505780_75229_01_000004 for log-aggregation
2012-11-07 05:36:33,775 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Application application_1351873505780_75229 transitioned from INITING to RUNNING
2012-11-07 05:36:33,775 [AsyncDispatcher event handler] WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Can't handle this event at current state: Current: [DONE], eventType: [INIT_CONTAINER]
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: INIT_CONTAINER
at DONE
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:826)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:71)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:554)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:547)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
        at java.lang.Thread.run(Thread.java:619)
2012-11-07 05:36:33,775 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1351873505780_75229_01_000004 transitioned from DONE to null


Fix should be to allow for the CONTAINER_DONE_TRANSITION processing to occur from the INITTING
state. This should remove the container from the list of containers the application is tracking
so that it finishes cleaning up when the application actually finishes. As it stands the application
is going to think this container is still running and will continue renewing log aggregation
releases for ever. 

                
> NM state machine ignores an APPLICATION_CONTAINER_FINISHED event when it shouldn't
> ----------------------------------------------------------------------------------
>
>                 Key: YARN-212
>                 URL: https://issues.apache.org/jira/browse/YARN-212
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.4
>            Reporter: Nathan Roberts
>            Assignee: Nathan Roberts
>            Priority: Blocker
>
> The NM state machines can make the following two invalid state transitions when a speculative
attempt is killed shortly after it gets started. When this happens the NM keeps the log aggregation
context open for this application and therefore chews up FDs and leases on the NN, eventually
running the NN out of FDs and bringing down the entire cluster.
> 2012-11-07 05:36:33,774 [AsyncDispatcher event handler] WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APPLICATION_CONTAINER_FINISHED
at INITING
> 2012-11-07 05:36:33,775 [AsyncDispatcher event handler] WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Can't handle this event at current state: Current: [DONE], eventType: [INIT_CONTAINER]
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: INIT_CONTAINER
at DONE

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message