Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Wed, 23 Jul 2014 13:58:40 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12728006.1405620755351.28765.1406123920945@arcas>
In-Reply-To: <JIRA.12728006.1405620755351@arcas>
References: <JIRA.12728006.1405620755351@arcas>
Subject: [jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can
 create thousands of threads for a large cluster
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071726#comment-14071726 ] 

Jason Lowe commented on YARN-2314:
----------------------------------

I suppose we could use a wait timeout.  I was just matching the behavior when it tries to refresh the NM token on an in-use proxy which also waits indefinitely.  What's the proposed behavior when the timeout expires?  Log a message and then...?  Arguably the timeouts should be on the RPC calls rather than the proxy cache, since I'm assuming if we're not willing to wait forever for a proxy to be freed up we're also not willing to wait forever for a remote call to complete.

> ContainerManagementProtocolProxy can create thousands of threads for a large cluster
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2314
>                 URL: https://issues.apache.org/jira/browse/YARN-2314
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>            Priority: Critical
>         Attachments: nmproxycachefix.prototype.patch
>
>
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable.  However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits.  More details in the first comment.


--
This message was sent by Atlassian JIRA
(v6.2#6252)