hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
Date Thu, 01 Dec 2011 08:53:40 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160749#comment-13160749
] 

Hudson commented on MAPREDUCE-3463:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #237 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/237/])
    MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1208995
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java

                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3463_v1.txt, MR3463_v2.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for
job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed.
FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM
Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and
the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job
job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService:
Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher:
Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message