hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3463) Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
Date Sun, 27 Nov 2011 23:48:40 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158095#comment-13158095
] 

Hadoop QA commented on MAPREDUCE-3463:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12504978/MR3463_v1.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The patch appears to cause tar ant target to fail.

    -1 findbugs.  The patch appears to cause Findbugs (version 1.3.9) to fail.

    +1 release audit.  The applied patch does not increase the total number of release audit
warnings.

    -1 core tests.  The patch failed the unit tests build

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1346//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1346//console

This message is automatically generated.
                
> Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException
causing lost job
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3463
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3463
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3463_v1.txt
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
> First Ran Randowriter/Sort/Sort-validate successfully
> Then again sort, when job was 50% complete
> Login node running AppMaster, and killed AppMaster with kill -9
> On Client side failed with following:
> {code}
> 11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
> 11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for
job job_1322040898409_0005 retrying..
> 11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
> 11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed.
FinalApplicationStatus=UNDEFINED. Redirecting to job history server
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM
Host>:Port
> 11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and
the token's service name is <New AM Host>:Port
> 11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job
job_1322040898409_0005
> RemoteTrace: 
>  at Local Trace: 
> 	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
> 	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
> 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
> 	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
> 	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
> 	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
> 	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
> 	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> 	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> 	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
> {code}
> On lookig RM logs found second AM was also lauched, it was saying -:
> {code}
> 011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1322040898409_0005 State change from RUNNING to FINISHED
> 2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
> {code}
> Now looking at AM logs and found Second AM was shutdown gracefully due to :-
> {code}
> 2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService:
Sending assigned event to attempt_1322040898409_0005_m_000000_0
> 2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher:
Error in dispatcher thread. Exiting..
> java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
>         at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
>         at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:619)
> 2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message