hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-933) Potential InvalidStateTransitonException: Invalid event: LAUNCHED at FINAL_SAVING
Date Tue, 10 Feb 2015 00:51:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313286#comment-14313286
] 

Jian He commented on YARN-933:
------------------------------

bq. You should not ignore RMAppAttemptEventType.LAUNCHED? We will have to explicitly kill
the AppAttempt and the AM in this case
The AM here is being killed. Allocated state gets the kill event and kill the AM(send the
clean up event to the AM launcher) and then moves to the final_saving state.  

> Potential InvalidStateTransitonException: Invalid event: LAUNCHED at FINAL_SAVING
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-933
>                 URL: https://issues.apache.org/jira/browse/YARN-933
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.5-alpha
>            Reporter: J.Andreina
>            Assignee: Rohith
>         Attachments: 0001-YARN-933.patch, 0001-YARN-933.patch, YARN-933.3.patch, YARN-933.patch
>
>
> am max retries configured as 3 at client and RM side.
> Step 1: Install cluster with NM on 2 Machines 
> Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname
should fail
> Step 3: Execute a job
> Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss
happened.
> Observation :
> ==========
> After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and
Application removal are successful. New AppAttempt_2 is sponed.
> 1. Then again retry for AppAttempt_1 happens.
> 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException
> 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running
], while the appattempts configured is 3 and rest appattempts are all sponed and running.
> RMLogs:
> ======
> 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1373952096466_0056_000001 State change from SCHEDULED to ALLOCATED
> 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45
> 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_000001
Timed out after 600 secs
> 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1373952096466_0056_01_000001 Container Transitioned from ACQUIRED to EXPIRED
> 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Registering appattempt_1373952096466_0056_000002
> 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application appattempt_1373952096466_0056_000001 is done. finalState=FAILED
> 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent:
root #applications: 35
> 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application Submission: appattempt_1373952096466_0056_000002, 
> 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1373952096466_0056_000002 State change from SUBMITTED to SCHEDULED
> 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); maxRetries=45
> 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); maxRetries=45
> 2013-07-17 16:38:56,207 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
Error launching appattempt_1373952096466_0056_000001. Got exception: java.lang.reflect.UndeclaredThrowableException
> 2013-07-17 16:38:56,207 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED
at FAILED
>  at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>  at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>  at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
>  at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630)
>  at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
>  at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495)
>  at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:476)
>  at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
>  at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
>  at java.lang.Thread.run(Thread.java:662)
> Client Logs
> ========
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
remote=host-10-18-40-15/10.18.40.59:8020]
>  at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:573)
>  at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
> 2013-07-17 16:37:05,987 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:Rex (auth:SIMPLE) cause:org.apache.hadoop.net.ConnectTimeoutException: Call From HOST-10-18-91-55/10.18.40.57
to host-10-18-40-15:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException:
20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
remote=host-10-18-40-15/10.18.40.59:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message