hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4697) NM aggregation thread pool is not bound by limits
Date Thu, 21 Apr 2016 16:31:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252164#comment-15252164
] 

Junping Du commented on YARN-4697:
----------------------------------

bq. My concern is that if don't fix the root-cause, though we've protected ourselves from
crashes, we'd just be queueing a lot of aggregation processes and causing long waiting times.
Agree. We do see NM log aggregation service launch many active threads which keep large number
of TCP connections to DN which use out system's file limit. We can fix shared limited thread
number here, but the TCP connections problem may not solved by this patch.

bq. Upon NM restart, NM will try to recover all applications and submit a log aggregation
task to the thread pool for each application recovered. Therefore, a large number of recovered
applications plus concurrent applications can cause the thread pool to increase without a
bound.
Does all these applications are active one or finished already? I suspect we are leaking finished
applications in NM state store in recover process. I noticed this issue in filing YARN-4325
but lost my progress as previous long running cluster is gone. [~haibochen], could you check
if your case is the same here?

In general, I think the fix on this JIRA is OK. But I agree with Vinod that we should dig
out more on the root cause or it could be other holes (like TCP connection leaking mentioned
above).

> NM aggregation thread pool is not bound by limits
> -------------------------------------------------
>
>                 Key: YARN-4697
>                 URL: https://issues.apache.org/jira/browse/YARN-4697
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>            Priority: Critical
>             Fix For: 2.9.0
>
>         Attachments: yarn4697.001.patch, yarn4697.002.patch, yarn4697.003.patch, yarn4697.004.patch
>
>
> In the LogAggregationService.java we create a threadpool to upload logs from the nodemanager
to HDFS if log aggregation is turned on. This is a cached threadpool which based on the javadoc
is an ulimited pool of threads.
> In the case that we have had a problem with log aggregation this could cause a problem
on restart. The number of threads created at that point could be huge and will put a large
load on the NameNode and in worse case could even bring it down due to file descriptor issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message