hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
Date Mon, 04 May 2015 15:59:07 GMT

    [ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526775#comment-14526775
] 

Jason Lowe commented on YARN-3554:
----------------------------------

YARN-3518 is a separate concern with different ramifications.  We should discuss it there
and not mix these two.

bq. set this to a bigger value maybe based on network partition considerations not only for
nm restart.
What value do you propose?  As pointed out earlier, anything over 10 minutes is pointless
since the container allocation expires in that time.  Is it common for network partitions
to take longer than 3 minutes but less than 10 minutes?  If so we should tune the value for
that.  If not then making the value larger just slows recovery time.

bq. 3 mins seems dangerous, If rm fails over and the recover takes serval mins, nm maybe kill
all containers, in production env, it's not expected.

This JIRA is configuring the amount of time NM clients (i.e.: primarily ApplicationMasters
and the RM when launching ApplicationMasters) will try to connect to a particular NM before
failing.  I'm missing how RM failover leads to a mass killing of containers due to this proposed
change.  This is not a property used by the NM, so the NM is not going to start killing all
containers differently based on an updated value for it.  The only case where the RM will
use this property is when connecting to NMs to launch AM containers, and it will only do so
for NMs that have recently heartbeated.  Could you explain how this leads to all containers
getting killed on a particular node?

> Default value for maximum nodemanager connect wait time is too high
> -------------------------------------------------------------------
>
>                 Key: YARN-3554
>                 URL: https://issues.apache.org/jira/browse/YARN-3554
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Naganarasimha G R
>              Labels: newbie
>         Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch
>
>
> The default value for yarn.client.nodemanager-connect.max-wait-ms is 900000 msec or 15
minutes, which is way too high.  The default container expiry time from the RM and the default
task timeout in MapReduce are both only 10 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message