hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Noguchi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
Date Tue, 27 Nov 2012 16:55:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504758#comment-13504758

Koji Noguchi commented on MAPREDUCE-4819:

bq. like the client never being notified at all because the AM crashes after unregistering
with the RM but before it notifies the client.

As long as client eventually fail, that's not a problem.

Critical problem we have here is false-positive from the client's perspective.
Client is getting 'success' but output is incomplete or corrupt(due to retried application/job
(over)writing to the same target path.)

If we can have AM and RM to agree on the job status before telling the client, I think that
would work.  There could be a corner case when AM and RM say the job was successful but client
thinks it failed. This false-negative is much better than false-positive issue we have now.
 Even in 0.20, we had cases when JobTracker reports job was successful but client thinks it
failed due to communication failure to the JobTracker.  This is fine to happen and we should
let the client handle the recovery-or-retry.

bq. In general it seems like we need to come up with a set of markers that previous AM's leave

I don't want the correctness of the job to depend on the marker on hdfs.

> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
> If the AM reports final job status to the client but then crashes before unregistering
with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that
the previous attempts did not reach a final job state, and that causes the job to rerun (from
scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is
bad for a number of reasons.  If the job failed, it's confusing at best since the client was
already told the job failed but the subsequent attempt could succeed.  If the job succeeded
there could be data loss, as a subsequent job launched by the client tries to consume the
job's output as input just as the re-attempt starts removing output files in preparation for
the output commit.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message