hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
Date Thu, 29 Nov 2012 17:01:00 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506589#comment-13506589
] 

Bikas Saha commented on MAPREDUCE-4819:
---------------------------------------

Attaching a patch based on the above suggestions of keeping the temp data around. The temp
data was actually being stored in the global staging dir and not the job-specific staging
dir. So it wasnt being deleted upon successful completion. I changed it so that its stored
inside the job staging directory and so all temp history will go away after the last successful
job/last retry.
Added a test and also verified manually by hacking a System.exit() in MRAppMaster.shutdownJob()
that the following works. An AM dies after reporting finished state but before unregistering.
It is restarted and the new AM exits with success after registering and unregistering with
the RM.

As far as YARN-244 is concerned the comments around the code seem to suggest that it was an
explicit decision to cleanup before unregistering. 
{noformat}
    // Add the staging directory cleaner before the history server but after
    // the container allocator so the staging directory is cleaned after
    // the history has been flushed but before unregistering with the RM.
    addService(createStagingDirCleaningService());
{noformat}

This patch addresses the issue for this jira - make sure a successfully completed job does
not rerun the job is the AM is retried. Pending a solution to YARN-244. But its safe because
once staging dir is cleaned up the next attempt cannot run. So its a fail stop.
                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch
>
>
> If the AM reports final job status to the client but then crashes before unregistering
with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that
the previous attempts did not reach a final job state, and that causes the job to rerun (from
scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is
bad for a number of reasons.  If the job failed, it's confusing at best since the client was
already told the job failed but the subsequent attempt could succeed.  If the job succeeded
there could be data loss, as a subsequent job launched by the client tries to consume the
job's output as input just as the re-attempt starts removing output files in preparation for
the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message