hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Friedrich (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5309) SSLFactory truststore reloader thread leak in TimelineClientImpl
Date Thu, 07 Jul 2016 01:23:11 GMT

    [ https://issues.apache.org/jira/browse/YARN-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365451#comment-15365451

Thomas Friedrich commented on YARN-5309:

Hi [~cheersyang], Hive still depends on an older version of Hadoop that doesn't have the JobClient
with AutoCloseable interface. In addition I found that without MAPREDUCE-6618 calling close
on the JobClient won't do anything either (and MAPREDUCE-6618 is only part of Hadoop 2.6.4
and 2.7.3). And when debugging, I found that Hive didn't call close on the JobClient to begin
I will test a newer Hadoop with your patch and additional changes in Hive to confirm that
your patch works for Hive. Then open another Hive JIRA linking to this one for the Hive changes.

> SSLFactory truststore reloader thread leak in TimelineClientImpl
> ----------------------------------------------------------------
>                 Key: YARN-5309
>                 URL: https://issues.apache.org/jira/browse/YARN-5309
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: timelineserver, yarn
>    Affects Versions: 2.7.1
>            Reporter: Thomas Friedrich
>            Assignee: Weiwei Yang
>         Attachments: YARN-5309.001.patch, YARN-5309.002.patch
> We found a similar issue as HADOOP-11368 in TimelineClientImpl. The class creates an
instance of SSLFactory in newSslConnConfigurator and subsequently creates the ReloadingX509TrustManager
instance which in turn starts a trust store reloader thread. 
> However, the SSLFactory is never destroyed and hence the trust store reloader threads
are not killed.
> This problem was observed by a customer who had SSL enabled in Hadoop and submitted many
queries against the HiveServer2. After a few days, the HS2 instance crashed and from the Java
dump we could see many (over 13000) threads like this:
> "Truststore reloader thread" #126 daemon prio=5 os_prio=0 tid=0x00007f680d2e3000 nid=0x98fd
waiting on 
> condition [0x00007f67e482c000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.security.ssl.ReloadingX509TrustManager.run
> (ReloadingX509TrustManager.java:225)
>         at java.lang.Thread.run(Thread.java:745)
> HiveServer2 uses the JobClient to submit a job:
> Thread [HiveServer2-Background-Pool: Thread-188] (Suspended (breakpoint at line 89 in

> ReloadingX509TrustManager))	
> 	owns: Object  (id=464)	
> 	owns: Object  (id=465)	
> 	owns: Object  (id=466)	
> 	owns: ServiceLoader<S>  (id=210)	
> 	ReloadingX509TrustManager.<init>(String, String, String, long) line: 89	
> 	FileBasedKeyStoresFactory.init(SSLFactory$Mode) line: 209	
> 	SSLFactory.init() line: 131	
> 	TimelineClientImpl.newSslConnConfigurator(int, Configuration) line: 532	
> 	TimelineClientImpl.newConnConfigurator(Configuration) line: 507	
> 	TimelineClientImpl.serviceInit(Configuration) line: 269	
> 	TimelineClientImpl(AbstractService).init(Configuration) line: 163	
> 	YarnClientImpl.serviceInit(Configuration) line: 169	
> 	YarnClientImpl(AbstractService).init(Configuration) line: 163	
> 	ResourceMgrDelegate.serviceInit(Configuration) line: 102	
> 	ResourceMgrDelegate(AbstractService).init(Configuration) line: 163	
> 	ResourceMgrDelegate.<init>(YarnConfiguration) line: 96	
> 	YARNRunner.<init>(Configuration) line: 112	
> 	YarnClientProtocolProvider.create(Configuration) line: 34	
> 	Cluster.initialize(InetSocketAddress, Configuration) line: 95	
> 	Cluster.<init>(InetSocketAddress, Configuration) line: 82	
> 	Cluster.<init>(Configuration) line: 75	
> 	JobClient.init(JobConf) line: 475	
> 	JobClient.<init>(JobConf) line: 454	
> 	MapRedTask(ExecDriver).execute(DriverContext) line: 401	
> 	MapRedTask.execute(DriverContext) line: 137	
> 	MapRedTask(Task<T>).executeTask() line: 160	
> 	TaskRunner.runSequential() line: 88	
> 	Driver.launchTask(Task<Serializable>, String, boolean, String, int, DriverContext)
line: 1653	
> 	Driver.execute() line: 1412	
> For every job, a new instance of JobClient/YarnClientImpl/TimelineClientImpl is created.
But because the HS2 process stays up for days, the previous trust store reloader threads are
still hanging around in the HS2 process and eventually use all the resources available. 
> It seems like a similar fix as HADOOP-11368 is needed in TimelineClientImpl but it doesn't
have a destroy method to begin with. 
> One option to avoid this problem is to disable the yarn timeline service (yarn.timeline-service.enabled=false).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message