Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3BD4AD9C8 for ; Tue, 27 Nov 2012 05:48:02 +0000 (UTC) Received: (qmail 18179 invoked by uid 500); 27 Nov 2012 05:48:01 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 17994 invoked by uid 500); 27 Nov 2012 05:48:01 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 17955 invoked by uid 99); 27 Nov 2012 05:47:59 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Nov 2012 05:47:59 +0000 Date: Tue, 27 Nov 2012 05:47:59 +0000 (UTC) From: "Bikas Saha (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1703635158.26224.1353995279938.JavaMail.jiratomcat@arcas> In-Reply-To: <1309689274.24015.1353956698669.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504388#comment-13504388 ] Bikas Saha commented on MAPREDUCE-4819: --------------------------------------- If the AM talks to more than 1 entities about status (client and RM) then such races are possible. Maybe final client notification should be the last thing after all post processing is done. This way the client is the last to know and will never know about completion if things go wrong before that. Like NN not responding to client until edits have been written. In general it seems like we need to come up with a set of markers that previous AM's leave behind that can tell the next retry if the previous one failed/succeeded and so the current AM should exit or continue to run. > AM can rerun job after reporting final job status to the client > --------------------------------------------------------------- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 0.23.3, 2.0.1-alpha > Reporter: Jason Lowe > Assignee: Bikas Saha > Priority: Critical > > If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). > Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira