hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rui Li <rui...@intel.com>
Subject Re: Review Request 56687: Intern strings in various critical places to reduce memory consumption.
Date Mon, 27 Feb 2017 09:48:18 GMT


> On Feb. 24, 2017, 7:38 a.m., Rui Li wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java, line
322
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643011#file1643011line322>
> >
> >     will this cause the hash map to resize since the default load factor is 0.75?
and several similar concerns below
> 
> Misha Dmitriev wrote:
>     You are probably right, in that this constructor's parameter is the initial capacity
of this table (more or less the size of the internal array) - not how many elements the table
is expected to hold. However, if you check the code of HashMap, the things are more interesting.
The actual capacity of the table is always a power of two, so unless this parameter is also
a power of two, the capacity will be chosen as the nearest higher power of two, i.e. it will
be higher than the parameter and closer to what we actually need. Also, if we create a table
with the default size (16) here and then will put many more elements into it, it will be resized
several times, whereas with the current code it will be resized at most once. Trying to "factor
in" the load factor will likely add more confusion/complexity. All in all, given that choosing
capacity in HashMap internally is non-trivial, I think it's easier/safer to just call 'new
HashMap(oldMap.size())' as we do now.

Then could you explain why we need to change the current code? The JavaDoc of LinkedHashMap(Map<?
extends K, ? extends V> m) indicates it will create an instance "with a default load factor
(0.75) and an initial capacity sufficient to hold the mappings in the specified map". Looking
at the code, it computes the initial cap like "m.size()/loadFactor + 1", rounds it to next
power of two, and it avoids re-hashing. Won't that be good enough for us?


- Rui


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56687/#review166649
-----------------------------------------------------------


On Feb. 24, 2017, 9:27 p.m., Misha Dmitriev wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56687/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2017, 9:27 p.m.)
> 
> 
> Review request for hive, Chaoyu Tang, Mohit Sabharwal, and Sergio Pena.
> 
> 
> Bugs: https://issues.apache.org/jira/browse/HIVE-15882
>     https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/HIVE-15882
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See the description of the problem in https://issues.apache.org/jira/browse/HIVE-15882
Interning strings per this review removes most of the overhead due to duplicate strings.
> 
> Also, where maps in several places are created from other maps, use the original map's
size for the new map. This is to avoid the situation when a map with default capacity (typically
16) is created to hold just 2-3 entries, and the rest of the internal 16-entry array is wasted.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java e81cbce3e333d44a4088c10491f399e92a505293

>   ql/src/java/org/apache/hadoop/hive/ql/hooks/Entity.java 08420664d59f28f75872c25c9f8ee42577b23451

>   ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java e91064b9c75e8adb2b36f21ff19ec0c1539b03b9

>   ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 51530ac16c92cc75d501bfcb573557754ba0c964

>   ql/src/java/org/apache/hadoop/hive/ql/io/SymbolicInputFormat.java 55b3b551a1dac92583b6e03b10beb8172ca93d45

>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java 82dc89803be9cf9e0018720eeceb90ff450bfdc8

>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java c0edde9e92314d86482b5c46178987e79fae57fe

>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java c6ae6f290857cfd10f1023058ede99bf4a10f057

>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 24d16812515bdfa90b4be7a295c0388fcdfe95ef

>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenMRSkewJoinProcessor.java
ede4fcbe342052ad86dadebcc49da2c0f515ea98 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java
0882ae2c6205b1636cbc92e76ef66bb70faadc76 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 68b0ad9ea63f051f16fec3652d8525f7ab07eb3f

>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java d4bdd96eaf8d179bed43b8a8c3be0d338940154a

>   ql/src/java/org/apache/hadoop/hive/ql/plan/MsckDesc.java b7a7e4b7a5f8941b080c7805d224d3885885f444

>   ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java 73981e826870139a42ad881103fdb0a2ef8433a2

> 
> Diff: https://reviews.apache.org/r/56687/diff/
> 
> 
> Testing
> -------
> 
> I've measured how much memory this change plus another one (interning Properties in PartitionDesc)
save in my HS2 benchmark - the result is 37%. See the details in HIVE-15882.
> 
> 
> Thanks,
> 
> Misha Dmitriev
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message