hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
Date Wed, 02 Jan 2013 15:00:19 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Joseph Evans updated MAPREDUCE-4819:
-------------------------------------------

    Attachment: MR-4819-bobby-trunk.txt

Bikas,

I would actually like to propose an alternative fix.  I am attaching a very preliminary patch.
 This will instead put a "lock" around the job commit by adding a few new files into the staging
directory.  Task commits would be required to handle the rare possibility of a double commit,
just as it is possible in 1.0 now.  We would make it just as likely to happen as it is in
1.0 by also putting in MAPREDUCE-4832 which would help to ensure that we don't have two AM
telling tasks to do things at the same time.

I would appreciate any feedback on this approach.  I am going to be working to add in more
tests and clean up the code.
                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch,
MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before unregistering
with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that
the previous attempts did not reach a final job state, and that causes the job to rerun (from
scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is
bad for a number of reasons.  If the job failed, it's confusing at best since the client was
already told the job failed but the subsequent attempt could succeed.  If the job succeeded
there could be data loss, as a subsequent job launched by the client tries to consume the
job's output as input just as the re-attempt starts removing output files in preparation for
the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message