Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0136511E38 for ; Thu, 1 May 2014 14:12:22 +0000 (UTC) Received: (qmail 96406 invoked by uid 500); 1 May 2014 14:12:19 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 96313 invoked by uid 500); 1 May 2014 14:12:18 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 96302 invoked by uid 99); 1 May 2014 14:12:18 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 May 2014 14:12:18 +0000 Date: Thu, 1 May 2014 14:12:18 +0000 (UTC) From: "Jason Lowe (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-5652: ---------------------------------- Attachment: MAPREDUCE-5652-v9-and-YARN-1987.patch Filed YARN-1987 to cover the DBIterator wrapper and updating the patch to use that new wrapper class. Note that the patch includes YARN-1987 so Jenkins can comment. bq. If ShuffleHandler gets DBException during recoverState as part of serviceStart, should ShuffleHandler ignore the exception and continue like the store doesn't exist? Failure to recover should be a rare situation where the DB is corrupted/inaccessible or there's some schema incompatibility between versions if an upgrade occurs during the NM downtime. It should be investigated and corrected, otherwise the errors will likely be glossed over and we will continue to fail to shuffle across NM restarts from that point forward despite the user specifying otherwise. We could add a config to request a "best effort" mode where it will continue despite the inability to recover, but is that an NM-wide config, a config just for the shuffle handler, or something else? If we want a config to control this I propose we address it in a followup JIRA. > NM Recovery. ShuffleHandler should handle NM restarts > ----------------------------------------------------- > > Key: MAPREDUCE-5652 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.2.0 > Reporter: Karthik Kambatla > Assignee: Jason Lowe > Labels: shuffle > Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, MAPREDUCE-5652-v7.patch, MAPREDUCE-5652-v8.patch, MAPREDUCE-5652-v9-and-YARN-1987.patch, MAPREDUCE-5652.patch > > > ShuffleHandler should work across NM restarts and not require re-running map-tasks. On NM restart, the map outputs are cleaned up requiring re-execution of map tasks and should be avoided. -- This message was sent by Atlassian JIRA (v6.2#6252)