hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
Date Thu, 29 Nov 2012 09:22:58 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506353#comment-13506353
] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-4819:
----------------------------------------------------

Just read up the whole wall of comments. This has become quite messy, we didn't foresee some
of this during design, *sigh*.

+1 to the proposal of leaving job-history files behind till the staging directory cleanup.

Jason't last comment about _SUCCESS files is very valid. In fact there is a third problem,
one related to the JobEndNotification. Essentially, there are the three ways client learn
about final status - RPC call, _SUCCESS markers and JobEndNotification. We need to make sure
we handle consistency with all the three. But let's split the tasks and let this one take
care of the usual RPC calls. Let's debate the solution for other two issues separately? I
propose this to keep things saner, it has become quite unweildly already.
                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch
>
>
> If the AM reports final job status to the client but then crashes before unregistering
with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that
the previous attempts did not reach a final job state, and that causes the job to rerun (from
scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is
bad for a number of reasons.  If the job failed, it's confusing at best since the client was
already told the job failed but the subsequent attempt could succeed.  If the job succeeded
there could be data loss, as a subsequent job launched by the client tries to consume the
job's output as input just as the re-attempt starts removing output files in preparation for
the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message