hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Fraison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6659) Mapreduce App master waits long to kill containers on lost nodes.
Date Wed, 08 Nov 2017 09:59:01 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243637#comment-16243637

Nicolas Fraison commented on MAPREDUCE-6659:

[~jlowe], MAPREDUCE-5465 is already applied on the hadoop release I use (cdh5.5.0).
I've tested on cdh5.5 and trunk the behaviour when a nodemanager is lost and it is the same.

The RM send a LostNM event to the AM which try to cleanup containers running on it (on cdh5.5
and on trunk). The attempt is failed only after a timeout to connect to the lost NM.
The main difference between cdh5.5 and the trunk is the timeout being really slower in trunk
(3 min instead of 30 min at least).
This is thanks to patches https://issues.apache.org/jira/browse/YARN-4414 and https://issues.apache.org/jira/browse/YARN-3554
Backporting those patches can be consider sufficient, what do you think about this?

> Mapreduce App master waits long to kill containers on lost nodes.
> -----------------------------------------------------------------
>                 Key: MAPREDUCE-6659
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6659
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.6.0
>            Reporter: Laxman
>            Assignee: Nicolas Fraison
> MR Application master waits for very long time to cleanup and relaunch the tasks on lost
nodes. Wait time is actually 2.5 hours (ipc.client.connect.max.retries * ipc.client.connect.max.retries.on.timeouts
* ipc.client.connect.timeout = 10 * 45 * 20 = 9000 seconds = 2.5 hours)
> Some similar issue related in RM-AM rpc protocol is fixed in YARN-3809.
> As fixed in YARN-3809, we may need to introduce new configurations to control this RPC
retry behavior.
> Also, I feel this total retry time should honor and capped maximum to global task time
out (mapreduce.task.timeout = 600000 default)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org

View raw message