hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5471) Succeed job tries to restart after RMrestart
Date Wed, 04 Sep 2013 14:55:57 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757831#comment-13757831
] 

Jason Lowe commented on MAPREDUCE-5471:
---------------------------------------

bq. If the MR AM scans staging directory, followed by intermediate directory and the done
directory before starting the new AM's jobs, we should be good. The tricky part is to locate
the file in the done directory. Let's see if we can do that.

I missed the offline discussions, so apologies if this is all stuff you have hashed over before.
 IMHO this is an issue that can be resolved by not having the RM return from the app unregister
call until the persistent store has confirmed it has removed the app state.  Otherwise we
can return control back to the AM which will proceed to remove its staging directory and the
app will fail to restart.  In other words, we cannot fix this completely in the MR AM with
a subsequent attempt because it is highly likely the subsequent attempt cannot be launched
successfully.

As for trying to protect the MR AM from a re-run after success, isn't this already a solved
problem?  See MAPREDUCE-4819.  The MR AM will check for prior success in the staging directory
and avoid re-running the job if that's there.  I don't think we need to go scraping through
the done_intermediate and done directory since the history file is copied, not moved, to the
done_intermediate directory.  And if the staging directory is missing then the subsequent
AM attempt isn't going to start anyway.
                
> Succeed job tries to restart after RMrestart
> --------------------------------------------
>
>                 Key: MAPREDUCE-5471
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5471
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: yeshavora
>            Assignee: Jian He
>            Priority: Blocker
>         Attachments: MR5471-1AM.log, MR5471-2AM.log
>
>
> Run a job , restart RM when job just finished. It should not restart the job once it
Succeed.
> After RM restart, The AM of restarted job fails with below error.
> AM log after Rmrestart:
> 013-08-19 17:29:21,144 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler:
Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0
> 2013-08-19 17:29:21,145 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler:
Stopped JobHistoryEventHandler. super.stop()
> 2013-08-19 17:29:21,146 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Deleting
staging directory hdfs://host1:port1/user/ABC/.staging/job_1376933101704_0001
> 2013-08-19 17:29:21,156 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster:
Error starting MRAppMaster
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.FileNotFoundException:
File does not exist: hdfs://host1:port1/ABC/.staging/job_1376933101704_0001/job.splitmetainfo
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1469)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1324)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1291)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:922)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:131)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1184)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:995)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1394)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1390)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1323)
> Caused by: java.io.FileNotFoundException: File does not exist: hdfs://host1:port1/ABC/.staging/job_1376933101704_0001/job.splitmetainfo
>         at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1121)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1113)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:78)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1113)
>         at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:51)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1464)
>         ... 17 more
> 2013-08-19 17:29:21,158 INFO [Thread-2] org.apache.hadoop.mapreduce.v2.app.MRAppMaster:
MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2013-08-19 17:29:21,159 WARN [Thread-2] org.apache.hadoop.util.ShutdownHookManager: ShutdownHook
'MRAppMasterShutdownHook' failed, java.lang.NullPointerException
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.setSignalled(MRAppMaster.java:805)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1344)
>         at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message