hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wanqiang Ji (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-7160) Job end notification not sent and client service not stopped after YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block
Date Wed, 05 Jun 2019 02:51:00 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856279#comment-16856279
] 

Wanqiang Ji commented on MAPREDUCE-7160:
----------------------------------------

Hi [~johnthomason], thanks for filling it. I think you need to submit the patch.

> Job end notification not sent and client service not stopped after YarnRuntimeException
in shutDownJob in MRAppMaster.java despite finally block
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7160
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7160
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 3.0.3
>            Reporter: John Thomason
>            Priority: Major
>
> If a YarnRuntimeException occurs during shutDownJob (in this case, at the line 'MRAppMaster.this.stop();')
in MRAppMaster.java, the try/catch block will catch it but the 'finally' block will not get
executed. Additionally, the client service should be stopped if an exception occurs and is
not currently.
> For example, if this error occurs:
> {code:java}
> 2018-11-20 12:38:46,173 INFO [Thread-171] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler:
In stop, writing event TASK_FINISHED 
> 2018-11-20 12:38:46,173 ERROR [Thread-171] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler:
Error writing History Event: org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5
java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
are bad. Aborting... 
> at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 12:38:46,174
INFO [Thread-171] org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler
failed in state STOPPED
> {code}
> Then the line 
> {code:java}
> catch (Throwable t) { 
>    LOG.warn("Graceful stop failed. Exiting.. ", t); 
>    exitMRAppMaster(1, t); 
> }
> {code}
> causes the MRAppMaster to exit before the following finally block can execute. This
means that the job end notification is not sent despite [https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc]. 
> Additionally, this finally block should also call 'clientService.stop();', as otherwise
the following errors occur on the client side:
> {code:java}
> 2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% 
> 2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% 
> 2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% 
> 2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 completed successfully

> 2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is completed.
FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 
> 2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020.
Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS) 2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to
server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS) 2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to
server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
> ......
> 2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020.
Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS) java.io.IOException: java.net.ConnectException: Your endpoint
configuration is wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344) at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:820)
at org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at java.security.AccessController.doPrivileged(Native
Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:244)
at org.apache.hadoop.util.RunJar.main(RunJar.java:158)
> {code}
>  
> Removing the finally block and adding calls to send the job end notification and stop
the client service prior to 'exitMRAppMaster' within the catch block, such as something similar
to below, fixes the issue:
> {code:java}
> catch (Throwable t) {
>    LOG.warn("Graceful stop failed. Exiting.. ", t);
>    if (isLastAMRetry && notifier != null) {
>       sendJobEndNotify(notifier);
>    }
>    try {
>       Thread.sleep(5000);
>    } catch (InterruptedException e) {
>       e.printStackTrace();
>    }
>    clientService.stop();
>    exitMRAppMaster(1, t);
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message