hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4697) NM aggregation thread pool is not bound by limits
Date Thu, 21 Apr 2016 19:27:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252503#comment-15252503

Junping Du commented on YARN-4697:

After investigating on our cluster which hit the same issue also recently, I think two root
cause here:
1. Due to YARN-4325, too many stale applications doesn't get purged from NM state store clean.
So NM recover too many stale application which will init app with log aggregation.
2. Because these applications are stale, some operation - like: createAppDir() will failed
with token issues, but we swallow exception there and continue to create invalid aggregator
- just file YARN-4984 to fix this issue.

> NM aggregation thread pool is not bound by limits
> -------------------------------------------------
>                 Key: YARN-4697
>                 URL: https://issues.apache.org/jira/browse/YARN-4697
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>            Priority: Critical
>             Fix For: 2.9.0
>         Attachments: yarn4697.001.patch, yarn4697.002.patch, yarn4697.003.patch, yarn4697.004.patch
> In the LogAggregationService.java we create a threadpool to upload logs from the nodemanager
to HDFS if log aggregation is turned on. This is a cached threadpool which based on the javadoc
is an ulimited pool of threads.
> In the case that we have had a problem with log aggregation this could cause a problem
on restart. The number of threads created at that point could be huge and will put a large
load on the NameNode and in worse case could even bring it down due to file descriptor issues.

This message was sent by Atlassian JIRA

View raw message