hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-6982) Containers on lost nodes are considered failed after a too long time.
Date Fri, 13 Oct 2017 13:16:00 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated MAPREDUCE-6982:
----------------------------------
    Resolution: Duplicate
        Status: Resolved  (was: Patch Available)

> Containers on lost nodes are considered failed after a too long time.
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6982
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6982
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.6.0
>         Environment: cdh5.5.0
>            Reporter: Nicolas Fraison
>            Priority: Minor
>         Attachments: MAPREDUCE-6982.patch
>
>
> Containers on lost nodes (nodemanager being unavailable or server being unavailable)
are considered failed after a too long time.
> This is due to the AppMaster trying to cleanup the container on the unavailable node.
> The proposed path will limit the impact of this timeout by managing NodeManager lost
events on AM as described below:
> *     on nodemanager service unavailibility (crash, oom ...):
>     When receiving lost NodeManager events, it failed the impacted attempt and do not
go through the cleanup stage.
> *     on nodemanager server unavailibility with default settings AM detect first that
the attempt is in timeout and try to cleanup the attempt:
> When receiving lost NodeManager events, it stop the cleanup process on the impacted container
and failed the attempt.
> This reduce the duration of the timeout to the timeout for detecting a NodeManager down.
> Similar issue than [MAPREDUCE-6659|https://issues.apache.org/jira/browse/MAPREDUCE-6659]
on which I can't attached the patch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message