hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
Date Tue, 22 Jul 2014 22:00:42 GMT

     [ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jason Lowe updated YARN-2314:

    Attachment: nmproxycachefix.prototype.patch

I was thinking along similar lines, but I am worried about the corner case where all RPCs
are in use.  I think we need to handle this case even if it's rare.  An AM running on a node
where it can see the RM but has a network cut to the rest of the cluster could go really bad
really quick otherwise.  If we don't handle the corner case then we'll continue to grow the
proxy cache beyond its boundaries as we do today, and that AM will explode with thousands
of threads for what may be a temporary network outage.

While debugging this I wrote up a quick prototype patch to try to fix the cache so that it
keeps the cache under the configured limit.  Attaching the patch for reference.  However as
I mentioned above, simply keeping the NM proxy cache under its configured limit means nothing
if we don't address the problems with connections remaining open in the IPC Client layer.

> ContainerManagementProtocolProxy can create thousands of threads for a large cluster
> ------------------------------------------------------------------------------------
>                 Key: YARN-2314
>                 URL: https://issues.apache.org/jira/browse/YARN-2314
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>            Priority: Critical
>         Attachments: nmproxycachefix.prototype.patch
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache
is configurable.  However the cache can grow far beyond the configured size when running on
a large cluster and blow AM address/container limits.  More details in the first comment.

This message was sent by Atlassian JIRA

View raw message