hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
Date Thu, 17 Jul 2014 18:33:07 GMT

    [ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065316#comment-14065316

Jason Lowe commented on YARN-2314:

The problem is that the cache doesn't try very hard to remove proxies when the cache is at
or beyond the maximum configured size.  When adding a new proxy to the cache and it should
remove an entry, it simply grabs the least-recently-used proxy and tries to close it.  If
the entry is currently in use then an entry isn't immediately removed and that means we're
running with a cache larger than configured.

This can get far worse on a big cluster.  For example, if the least-recently-used proxy is
currently performing a call that is stuck on socket connection retries, the LRU entry could
take quite a while before it closes.  During that time each new proxy created will make the
same attempt to close that proxy and fail to do so.  That means that the cache size is now
N-1 larger than it should be when it finally does close where N is the number of proxies created
while the LRU entry was busy.

On a large cluster with thousands of nodes a proxy hanging on one node could allow the cache
to have thousands of more proxies in it than configured.  Since each proxy is a thread, that's
thousands of threads, and all those thread stacks can blow container limits on the AM (or
address limits if it's a 32-bit AM).

> ContainerManagementProtocolProxy can create thousands of threads for a large cluster
> ------------------------------------------------------------------------------------
>                 Key: YARN-2314
>                 URL: https://issues.apache.org/jira/browse/YARN-2314
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>            Priority: Critical
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache
is configurable.  However the cache can grow far beyond the configured size when running on
a large cluster and blow AM address/container limits.  More details in the first comment.

This message was sent by Atlassian JIRA

View raw message