hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Misha Dmitriev (JIRA)" <>
Subject [jira] [Created] (HIVE-16489) HMS wastes 26.4% of memory due to dup strings in metastore.api.Partition.parameters
Date Thu, 20 Apr 2017 23:43:04 GMT
Misha Dmitriev created HIVE-16489:

             Summary: HMS wastes 26.4% of memory due to dup strings in metastore.api.Partition.parameters
                 Key: HIVE-16489
             Project: Hive
          Issue Type: Improvement
          Components: HiveServer2
            Reporter: Misha Dmitriev
            Assignee: Misha Dmitriev

I've created a Hive table with 2000 partitions, each backed by two files, with one row in
each file. When I execute some number of concurrent queries against this table, e.g. as follows

for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:10000 -n admin -p admin -e "select
count(i_f_1) from misha_table;" & done

it results in a big memory spike. With 20 queries I caused an OOM in a HS2 server with -Xmx200m
and with 50 queries - in the one with -Xmx500m.

I am attaching the results of jxray ( analysis of a heap dump that was generated
in the 50queries/500m heap scenario. It suggests that there are several opportunities to reduce
memory pressure with not very invasive changes to the code. One (duplicate strings) has been
addressed in In this ticket, I am going to
address the fact that almost 20% of memory is used by instances of java.util.Properties. These
objects are highly duplicate, since for each partition each concurrently running query creates
its own copy of Partion, PartitionDesc and Properties. Thus we have nearly 100,000 (50 queries
* 2,000 partitions) Properties in memory. By interning/deduplicating these objects we may
be able to save perhaps 15% of memory.

Note, however, that if there are queries that mutate partitions, the corresponding Properties
would be mutated as well. Thus we cannot simply use a single "canonicalized" Properties object
at all times for all Partition objects representing the same DB partition. Instead, I am going
to introduce a special CopyOnFirstWriteProperties class. Such an object initially internally
references a canonicalized Properties object, and keeps doing so while only read methods are
called. However, once any mutating method is called, the given CopyOnFirstWriteProperties
copies the data into its own table from the canonicalized table, and uses it ever after.

This message was sent by Atlassian JIRA

View raw message