hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
Date Thu, 18 Jun 2015 01:35:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591093#comment-14591093
] 

Jun Gong commented on YARN-3809:
--------------------------------

[~devaraj.k] and [~kasha], thank you for the comments and suggestions.

{quote}
Shouldn't the number of threads in the pool be at least as big as the maximum number of apps
that could run on a node?By making it configurable, how do we expect the admins to pick this
number? Just pick an arbitrarily high value?
{quote}
Threads in the pool are just launching/stopping AMs, so it will be better that the number
of threads in the pool is at least as big as the maximum number of AMs that could run on a
node. Although we could not know the max value for all clusters in advance, a larger value
will make it faster that deal with AMLauncher events. Admins could just pick the default value,
and they could adjust the value if they find the value is a little small.

{quote}
Or, could we make it so we don't wait as long as 15 minutes?
{quote}
Yes, we could make it shorter. I think we also need a larger thread pool, then it could deal
with more events at the same time.

> Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
> ----------------------------------------------------------------------------------
>
>                 Key: YARN-3809
>                 URL: https://issues.apache.org/jira/browse/YARN-3809
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3809.01.patch
>
>
> ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH
and CLEANUP).
> In our cluster, there was many NM with 10+ AM running on it, and one shut down for some
reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher
need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled
up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM
was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher
could not handle new event such as LAUNCH, then new attempts will fails to launch because
of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message