hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Misha Dmitriev (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-16079) HS2: high memory pressure due to duplicate Properties objects
Date Wed, 01 Mar 2017 20:27:45 GMT
Misha Dmitriev created HIVE-16079:
-------------------------------------

             Summary: HS2: high memory pressure due to duplicate Properties objects
                 Key: HIVE-16079
                 URL: https://issues.apache.org/jira/browse/HIVE-16079
             Project: Hive
          Issue Type: Improvement
          Components: HiveServer2
            Reporter: Misha Dmitriev
            Assignee: Misha Dmitriev
         Attachments: hs2-crash-2000p-500m-50q.txt

I've created a Hive table with 2000 partitions, each backed by two files, with one row in
each file. When I execute some number of concurrent queries against this table, e.g. as follows

{code}
for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:10000 -n admin -p admin -e "select
count(i_f_1) from misha_table;" & done
{code}

it results in a big memory spike. With 20 queries I caused an OOM in a HS2 server with -Xmx200m
and with 50 queries - in the one with -Xmx500m.

I am attaching the results of jxray (www.jxray.com) analysis of a heap dump that was generated
in the 50queries/500m heap scenario. It suggests that there are several opportunities to reduce
memory pressure with not very invasive changes to the code:

1. 24.5% of memory is wasted by duplicate strings (see section 6). With String.intern() calls
added in the ~10 relevant places in the code, this overhead can be highly reduced.

2. Almost 20% of memory is wasted due to various suboptimally used collections (see section
8). There are many maps and lists that are either empty or have just 1 element. By modifying
the code that creates and populates these collections, we may likely save 5-10% of memory.

3. Almost 20% of memory is used by instances of java.util.Properties. It looks like these
objects are highly duplicate, since for each Partition each concurrently running query creates
its own copy of Partion, PartitionDesc and Properties. Thus we have nearly 100,000 (50 queries
* 2,000 partitions) Properties in memory. By interning/deduplicating these objects we may
be able to save perhaps 15% of memory.

So overall, I think there is a good chance to reduce HS2 memory consumption in this scenario
by ~40%.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message