Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Wed, 28 Nov 2012 16:24:59 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <1351039503.33253.1354119899353.JavaMail.jiratomcat@arcas>
In-Reply-To: <1309689274.24015.1353956698669.JavaMail.jiratomcat@arcas>
Subject: [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after
 reporting final job status to the client
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505617#comment-13505617 ] 

Jason Lowe commented on MAPREDUCE-4819:
---------------------------------------

We have to be careful about the fact that the job history log is moved to the done intermediate directory during shutdown after notifying the client.  Therefore there's a window of opportunity where we can fail after notifying the client and moving the job history file but before unregistering from the RM.  When the app attempt restarts in that case, the job history file won't be found and we'll end up re-running the job from scratch.  We either need to unregister from the RM first (and rely on the FINISHING grace period to buy us enough time to move the file) or explicitly *not* delete the file when we copy it to done intermediate and instead wait for the staging directory to be removed later to clean it up.
                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>
> If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is bad for a number of reasons.  If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed.  If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira