Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AD86D109ED for ; Wed, 15 Jan 2014 18:07:31 +0000 (UTC) Received: (qmail 60204 invoked by uid 500); 15 Jan 2014 18:07:30 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 60090 invoked by uid 500); 15 Jan 2014 18:07:30 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 60045 invoked by uid 99); 15 Jan 2014 18:07:29 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jan 2014 18:07:29 +0000 Date: Wed, 15 Jan 2014 18:07:29 +0000 (UTC) From: "Karthik Kambatla (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-5718) MR AM should tolerate RM restart/failover during commit MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872340#comment-13872340 ] Karthik Kambatla commented on MAPREDUCE-5718: --------------------------------------------- Thanks for chiming in, Jason. Please correct me if I am wrong. Not being able to tolerate node failures (slaves/master) seems like a major regression from MR1 which tolerates slave failures. I am wondering if there is a way to solve the crashed commits issue not just for all jobs. For MR, what do you think of committing to an intermediate location, and renaming it to the output location? If the output location is missing, the commit can be retried. > MR AM should tolerate RM restart/failover during commit > ------------------------------------------------------- > > Key: MAPREDUCE-5718 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 2.4.0 > Reporter: Karthik Kambatla > Assignee: Karthik Kambatla > Labels: ha > Attachments: mr-5718-0.patch > > > While testing RM HA, we ran into this issue where if the RM fails over while an MR AM is in the middle of a commit, the subsequent AM gets spawned but dies with a diagnostic message - "We crashed durring a commit". -- This message was sent by Atlassian JIRA (v6.1.5#6160)