hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
Date Tue, 27 Nov 2012 17:31:58 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504784#comment-13504784
] 

Robert Joseph Evans commented on MAPREDUCE-4819:
------------------------------------------------

We are informing several different actors of "success/failure" in many different ways.

# _SUCCESS file being written to HDFS by the output committer as part of commitJob()
# job end notification by hitting an http server
# client being informed through RPC
# history server being informed by placing the log in a directory it can see
# resource manager being informed that the application is done

Some of these are much more important to report then others, but either way we still have
at a minimum two different things that need to be tied together the commitJob and informing
the RM not to run us again.  Rearranging the order of them will not fix the fact that after
commitJob() finishes there is the possibility that something will fail but must not fail the
job.  We really need to have a two phase commit in the job history file. 

I am about to commit the job output.
commitJob()
I finished committing the job output successfully. 

Without this there will always be the possibility that commitJob will be called twice, which
would result in changes to the output directory. I would argue too that some of these are
important enough that we consider reporting them twice and updating the listener to handle
double reporting.  Like informing the history server about the job finishing.  Others it is
not so critical, like job end notification or client RPC.

Koji,

I get that we want to reduce the risk of a user shooting themselves in the foot, but the file
must be stored in a user accessible location because the entire job is run as the user.  It
is stored under the .staging directory which if the user deletes will cause many other problems
already and probably cause the job to fail.  We can try to set it up so that if the previous
job history file does not exist on any app attempt but the first one we fail fast.  That would
prevent retries from messing up the output directory.
                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>
> If the AM reports final job status to the client but then crashes before unregistering
with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that
the previous attempts did not reach a final job state, and that causes the job to rerun (from
scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is
bad for a number of reasons.  If the job failed, it's confusing at best since the client was
already told the job failed but the subsequent attempt could succeed.  If the job succeeded
there could be data loss, as a subsequent job launched by the client tries to consume the
job's output as input just as the re-attempt starts removing output files in preparation for
the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message