hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Laxman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
Date Thu, 24 Mar 2016 08:07:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209940#comment-15209940

Laxman commented on YARN-3809:

We are using hadoop 2.6.0. And we are facing the similar issue with AM-NM RPC protocol as
well.  After AM notified with a lost NM, I see AM threads hung in the code containerMgrProxy.stopContainers(stopRequest).
Current patch issue fixes only RM-AM protocol. Should we have similar fix for AM-NM as well?

> Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
> ----------------------------------------------------------------------------------
>                 Key: YARN-3809
>                 URL: https://issues.apache.org/jira/browse/YARN-3809
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>             Fix For: 2.7.1
>         Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch
> ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH
> In our cluster, there was many NM with 10+ AM running on it, and one shut down for some
reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher
need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled
up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM
was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher
could not handle new event such as LAUNCH, then new attempts will fails to launch because
of time out.

This message was sent by Atlassian JIRA

View raw message