Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 13 Oct 2017 13:16:00 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.13109216.1507888750000.3719.1507900560727@Atlassian.JIRA>
In-Reply-To: <JIRA.13109216.1507888750000@Atlassian.JIRA>
References: <JIRA.13109216.1507888750000@Atlassian.JIRA> <JIRA.13109216.1507888750613@jira-lw-us.apache.org>
Subject: [jira] [Updated] (MAPREDUCE-6982) Containers on lost nodes are
 considered failed after a too long time.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 13 Oct 2017 13:16:05 -0000


     [ https://issues.apache.org/jira/browse/MAPREDUCE-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-6982:
----------------------------------
    Resolution: Duplicate
        Status: Resolved  (was: Patch Available)

> Containers on lost nodes are considered failed after a too long time.
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6982
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6982
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.6.0
>         Environment: cdh5.5.0
>            Reporter: Nicolas Fraison
>            Priority: Minor
>         Attachments: MAPREDUCE-6982.patch
>
>
> Containers on lost nodes (nodemanager being unavailable or server being unavailable) are considered failed after a too long time.
> This is due to the AppMaster trying to cleanup the container on the unavailable node.
> The proposed path will limit the impact of this timeout by managing NodeManager lost events on AM as described below:
> *     on nodemanager service unavailibility (crash, oom ...):
>     When receiving lost NodeManager events, it failed the impacted attempt and do not go through the cleanup stage.
> *     on nodemanager server unavailibility with default settings AM detect first that the attempt is in timeout and try to cleanup the attempt:
> When receiving lost NodeManager events, it stop the cleanup process on the impacted container and failed the attempt.
> This reduce the duration of the timeout to the timeout for detecting a NodeManager down.
> Similar issue than [MAPREDUCE-6659|https://issues.apache.org/jira/browse/MAPREDUCE-6659] on which I can't attached the patch.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org