tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Turner Eagles (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-4070) SSLFactory not closed in DAGClientTimelineImpl caused native memory issues
Date Fri, 31 Jan 2020 21:25:00 GMT

    [ https://issues.apache.org/jira/browse/TEZ-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027843#comment-17027843

Jonathan Turner Eagles commented on TEZ-4070:

I'll put this in my queue to review. Thanks for supplying a patch.

> SSLFactory not closed in DAGClientTimelineImpl caused native memory issues
> --------------------------------------------------------------------------
>                 Key: TEZ-4070
>                 URL: https://issues.apache.org/jira/browse/TEZ-4070
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Xun REN
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4070.01.patch, TEZ-4070.02.patch, TEZ-4070.03.patch
> Hi,
> Recently, we're facing native memory issues on Redhat servers. It crashed completely
our servers. 
> *Context:*
> - HDP-2.6.5 
> - Redhat 7.4
> *Problem:*
> After upgrading from HDP-2.6.2 to HDP-2.6.5, after several days running, our HiveServer2
can eat up to more than 100GB memory. However, we have configured Xmx20G and MaxMetaspace
to 10GB.
> After searching, we have found the similar issue here:
> https://issues.apache.org/jira/browse/YARN-5309
> This is fixed in the hadoop-common module. Our version includes already this issue, however,
we still have the problem.
> After searching, I have found that in the class org.apache.tez.dag.api.client.TimelineReaderFactory
of Tez, if HTTPS is used for YARN, it will create SSLFactory which is not destroyed after
> TimelineReaderFactory is referenced in the class DAGClientTimelineImpl.
> If ATS is used and DAG is completed, the method switchToTimelineClient in the class DAGClientImpl
will be called. It will close the previous HTTPClient, but not the SSLFactory inside. And
the SSLFactory will create a thread for each connection. Finally, we will get thousands of
threads consuming a lot native memories.
> {code:java}
> private void switchToTimelineClient() throws IOException, TezException {
>  realClient.close();
>  realClient = new DAGClientTimelineImpl(appId, dagId, conf, frameworkClient,
>  if (LOG.isDebugEnabled()) {
>  LOG.debug("dag completed switching to DAGClientTimelineImpl");
>  }
> }{code}
> I have checked on another environment which is still on HDP-2.6.2, we also have a lot
of running threads holding by SSLFactory. That means the problem is zoomed in the version
> *How to reproduce the problem:*
> 1. Use Tez as Hive execution engine
> 2. Launch a Beeline session for Hive
> 3. Do a select with a simple where clause on a table
> 4. Repeat steps 2-3 in order to open different connections (it is the case for a shared
cluster with multiple clients).
> Finally, you can check in the thread dump file, that a lot of threads are named "Truststore
reloader thread". And the native memory usage is very high by doing the command "top" or "ps".

This message was sent by Atlassian Jira

View raw message