Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C1DBD200B5A for ; Wed, 20 Jul 2016 14:10:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C094B160A7C; Wed, 20 Jul 2016 12:10:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 14D10160A86 for ; Wed, 20 Jul 2016 14:10:21 +0200 (CEST) Received: (qmail 31655 invoked by uid 500); 20 Jul 2016 12:10:21 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 31494 invoked by uid 99); 20 Jul 2016 12:10:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Jul 2016 12:10:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id ABC3E2C0D5E for ; Wed, 20 Jul 2016 12:10:20 +0000 (UTC) Date: Wed, 20 Jul 2016 12:10:20 +0000 (UTC) From: "Weiwei Yang (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-5309) Fix SSLFactory truststore reloader thread leak in TimelineClientImpl MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 20 Jul 2016 12:10:22 -0000 [ https://issues.apache.org/jira/browse/YARN-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385744#comment-15385744 ] Weiwei Yang commented on YARN-5309: ----------------------------------- [~vvasudev] Thanks a lot for all your help! > Fix SSLFactory truststore reloader thread leak in TimelineClientImpl > -------------------------------------------------------------------- > > Key: YARN-5309 > URL: https://issues.apache.org/jira/browse/YARN-5309 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver, yarn > Affects Versions: 2.7.1 > Reporter: Thomas Friedrich > Assignee: Weiwei Yang > Priority: Blocker > Fix For: 2.7.3 > > Attachments: YARN-5309.001.patch, YARN-5309.002.patch, YARN-5309.003.patch, YARN-5309.004.patch, YARN-5309.005.patch, YARN-5309.branch-2.7.3.001.patch, YARN-5309.branch-2.8.001.patch > > > We found a similar issue as HADOOP-11368 in TimelineClientImpl. The class creates an instance of SSLFactory in newSslConnConfigurator and subsequently creates the ReloadingX509TrustManager instance which in turn starts a trust store reloader thread. > However, the SSLFactory is never destroyed and hence the trust store reloader threads are not killed. > This problem was observed by a customer who had SSL enabled in Hadoop and submitted many queries against the HiveServer2. After a few days, the HS2 instance crashed and from the Java dump we could see many (over 13000) threads like this: > "Truststore reloader thread" #126 daemon prio=5 os_prio=0 tid=0x00007f680d2e3000 nid=0x98fd waiting on > condition [0x00007f67e482c000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.security.ssl.ReloadingX509TrustManager.run > (ReloadingX509TrustManager.java:225) > at java.lang.Thread.run(Thread.java:745) > HiveServer2 uses the JobClient to submit a job: > Thread [HiveServer2-Background-Pool: Thread-188] (Suspended (breakpoint at line 89 in > ReloadingX509TrustManager)) > owns: Object (id=464) > owns: Object (id=465) > owns: Object (id=466) > owns: ServiceLoader (id=210) > ReloadingX509TrustManager.(String, String, String, long) line: 89 > FileBasedKeyStoresFactory.init(SSLFactory$Mode) line: 209 > SSLFactory.init() line: 131 > TimelineClientImpl.newSslConnConfigurator(int, Configuration) line: 532 > TimelineClientImpl.newConnConfigurator(Configuration) line: 507 > TimelineClientImpl.serviceInit(Configuration) line: 269 > TimelineClientImpl(AbstractService).init(Configuration) line: 163 > YarnClientImpl.serviceInit(Configuration) line: 169 > YarnClientImpl(AbstractService).init(Configuration) line: 163 > ResourceMgrDelegate.serviceInit(Configuration) line: 102 > ResourceMgrDelegate(AbstractService).init(Configuration) line: 163 > ResourceMgrDelegate.(YarnConfiguration) line: 96 > YARNRunner.(Configuration) line: 112 > YarnClientProtocolProvider.create(Configuration) line: 34 > Cluster.initialize(InetSocketAddress, Configuration) line: 95 > Cluster.(InetSocketAddress, Configuration) line: 82 > Cluster.(Configuration) line: 75 > JobClient.init(JobConf) line: 475 > JobClient.(JobConf) line: 454 > MapRedTask(ExecDriver).execute(DriverContext) line: 401 > MapRedTask.execute(DriverContext) line: 137 > MapRedTask(Task).executeTask() line: 160 > TaskRunner.runSequential() line: 88 > Driver.launchTask(Task, String, boolean, String, int, DriverContext) line: 1653 > Driver.execute() line: 1412 > For every job, a new instance of JobClient/YarnClientImpl/TimelineClientImpl is created. But because the HS2 process stays up for days, the previous trust store reloader threads are still hanging around in the HS2 process and eventually use all the resources available. > It seems like a similar fix as HADOOP-11368 is needed in TimelineClientImpl but it doesn't have a destroy method to begin with. > One option to avoid this problem is to disable the yarn timeline service (yarn.timeline-service.enabled=false). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org