hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (MAPREDUCE-6659) Mapreduce App master waits long to kill containers on lost nodes.
Date Fri, 13 Oct 2017 13:20:00 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe reassigned MAPREDUCE-6659:
-------------------------------------

    Assignee: Nicolas Fraison  (was: Jun Gong)

This has been sitting idle, so I'm assigning this to Nicolas to help move this forward.

I'm wondering if this is an issue in 2.8 or later.  I would expect the RM to send completed
container events along with the lost node event.  After MAPREDUCE-5465 the AM does not attempt
to kill containers that have been marked by the RM as already completed.  Seems like porting
that portion of the state machine change from MAPREDUCE-5465 would be another viable alternative.


> Mapreduce App master waits long to kill containers on lost nodes.
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-6659
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6659
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.6.0
>            Reporter: Laxman
>            Assignee: Nicolas Fraison
>
> MR Application master waits for very long time to cleanup and relaunch the tasks on lost
nodes. Wait time is actually 2.5 hours (ipc.client.connect.max.retries * ipc.client.connect.max.retries.on.timeouts
* ipc.client.connect.timeout = 10 * 45 * 20 = 9000 seconds = 2.5 hours)
> Some similar issue related in RM-AM rpc protocol is fixed in YARN-3809.
> As fixed in YARN-3809, we may need to introduce new configurations to control this RPC
retry behavior.
> Also, I feel this total retry time should honor and capped maximum to global task time
out (mapreduce.task.timeout = 600000 default)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message