hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
Date Thu, 18 Jun 2015 13:49:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591806#comment-14591806

Jason Lowe commented on YARN-3809:

I agree the thread pool should be configurable and possibly larger by default, but note that
the thread pool size has little to do with the number of AMs running on a node.  The pool
size needs to be as large as the worst-case number of AMs being launched during the worst-case
retry duration to avoid any unnecessary delays.  For a 15 minute retry delay on a cluster
launching multiple apps per second on average, that's an unreasonable thread pool size.  I
agree with [~kasha] that we need to lower the retry timeout as part of this fix.  As it is
today we will expire the NM due to lack of heartbeat before we will give up on an AM launch
retry which makes no sense.

We can update ipc.client.connect.max.retries.on.timeouts and potentially ipc.client.connect.timeout
for the conf passed to the NM proxy we create, although we need to make sure we make a copy
of the config to avoid polluting other proxies.

> Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
> ----------------------------------------------------------------------------------
>                 Key: YARN-3809
>                 URL: https://issues.apache.org/jira/browse/YARN-3809
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3809.01.patch
> ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH
> In our cluster, there was many NM with 10+ AM running on it, and one shut down for some
reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher
need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled
up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM
was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher
could not handle new event such as LAUNCH, then new attempts will fails to launch because
of time out.

This message was sent by Atlassian JIRA

View raw message