Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Thu, 1 May 2014 14:12:18 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.12681128.1385413613402.213354.1398953538666@arcas>
In-Reply-To: <JIRA.12681128.1385413613402@arcas>
References: <JIRA.12681128.1385413613402@arcas>
Subject: [jira] [Updated] (MAPREDUCE-5652) NM Recovery. ShuffleHandler
 should handle NM restarts
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-5652:
----------------------------------

    Attachment: MAPREDUCE-5652-v9-and-YARN-1987.patch

Filed YARN-1987 to cover the DBIterator wrapper and updating the patch to use that new wrapper class.  Note that the patch includes YARN-1987 so Jenkins can comment.

bq. If ShuffleHandler gets DBException during recoverState as part of serviceStart, should ShuffleHandler ignore the exception and continue like the store doesn't exist?

Failure to recover should be a rare situation where the DB is corrupted/inaccessible or there's some schema incompatibility between versions if an upgrade occurs during the NM downtime.  It should be investigated and corrected, otherwise the errors will likely be glossed over and we will continue to fail to shuffle across NM restarts from that point forward despite the user specifying otherwise.

We could add a config to request a "best effort" mode where it will continue despite the inability to recover, but is that an NM-wide config, a config just for the shuffle handler, or something else?  If we want a config to control this I propose we address it in a followup JIRA.

> NM Recovery. ShuffleHandler should handle NM restarts
> -----------------------------------------------------
>
>                 Key: MAPREDUCE-5652
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>            Reporter: Karthik Kambatla
>            Assignee: Jason Lowe
>              Labels: shuffle
>         Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, MAPREDUCE-5652-v7.patch, MAPREDUCE-5652-v8.patch, MAPREDUCE-5652-v9-and-YARN-1987.patch, MAPREDUCE-5652.patch
>
>
> ShuffleHandler should work across NM restarts and not require re-running map-tasks. On NM restart, the map outputs are cleaned up requiring re-execution of map tasks and should be avoided.


--
This message was sent by Atlassian JIRA
(v6.2#6252)