hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
Date Wed, 28 Nov 2012 21:49:58 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505935#comment-13505935

Jason Lowe commented on MAPREDUCE-4819:

Took a look at the patch, and I think we are missing some critical corner cases.  For example,
if we finish committing the job and the committer is using a marker of sorts (e.g.: _SUCCESS),
then we could trigger downstream jobs to run *before* the job history is completely closed.
 I believe Oozie is polling for the _SUCCESS marker, for example.  If we crash after committing
but before writing the job finished record then we could end up re-committing again while
another job is attempting to consume our output, leading to potential data loss even though
both jobs would have "SUCCEEDED".  That's a Bad Thing.

I think the crux of the issue is that we must not commit twice.  The act of committing is
what could trigger downstream jobs or in itself not be repeatable/recoverable, so we should
treat AM crashes during job commit much like we treat non-crashing failures during job commit
today, i.e.: it should fail the job without re-running and re-committing.  Worst-case we have
a false negative where the output did commit successfully but we thought the job failed, and
I agree with Koji that a false negative beats a false positive in this case.

This means we need a marker noting when we start and stop committing sync'd to the job history
file.  If the AM relaunches and finds we crashed during commit, we should treat it as we do
a committer failure and fail the job.  If the re-attempt finds we finished committing then
we simply need to unregister from the RM without re-running.
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch
> If the AM reports final job status to the client but then crashes before unregistering
with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that
the previous attempts did not reach a final job state, and that causes the job to rerun (from
scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is
bad for a number of reasons.  If the job failed, it's confusing at best since the client was
already told the job failed but the subsequent attempt could succeed.  If the job succeeded
there could be data loss, as a subsequent job launched by the client tries to consume the
job's output as input just as the re-attempt starts removing output files in preparation for
the output commit.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message