hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-11188) Make ORCFile's String Dictionary more efficient
Date Tue, 07 Jul 2015 07:34:04 GMT

     [ https://issues.apache.org/jira/browse/HIVE-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zheng Shao updated HIVE-11188:
------------------------------
    Description: 
Currently, ORCFile's String Dictionary uses StringRedBlackTree for adding/finding/sorting
duplicate strings.  When there are a large number of unique strings (let's say over 16K) and
a large number of rows (let's say 1M), the binary search will take O(1M * log(16K)) time which
can be very long.

Alternatively, ORCFile's String Dictionary can use HashMap for adding/finding duplicate strings,
and use quicksort at the end to produce a sorted order.  In the same case above, the total
time spent will be O(1M + 16K * log(16K)) which is much faster.

When the number of unique string is close to the number of rows (let's say, both around 1M),
ORC will automatically disable the dictionary encoding.  In the old approach will take O(1M
* log(1M)), and our new approach will take O(1M) since we can skip the final quicksort if
the dictionary encoding is disabled.

So in either case, the new approach should be a win.


Here is an PMP output based on ~600 traces (so 126 means 126/600 ~= 21% of total time). It's
a query like "INSERT OVERWRITE TABLE target SELECT * FROM src" using hive-1.1.0-cdh-5.4.1.
target TABLE is STORED AS ORC, and src TABLE is STORED AS RCFILE.

    126  org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:67)
     35  java.util.zip.Deflater.deflateBytes(Native Method)
     26  org.apache.hadoop.hive.ql.io.orc.SerializationUtils.findClosestNumBits(SerializationUtils.java:218)
     24  org.apache.hadoop.hive.serde2.lazy.LazyNonPrimitive.isNull(LazyNonPrimitive.java:63)
     22  org.apache.hadoop.hive.serde2.lazy.LazyMap.parse(LazyMap.java:204)
     22  org.apache.hadoop.hive.serde2.lazy.LazyLong.parseLong(LazyLong.java:116)
     21  org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:111)
     19  org.apache.hadoop.hive.serde2.lazy.LazyPrimitive.hashCode(LazyPrimitive.java:57)
     18  org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getRight(RedBlackTree.java:99)
     16  org.apache.hadoop.hive.ql.io.RCFile$Reader.getCurrentRow(RCFile.java:1932)
     15  org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
     15  org.apache.hadoop.hive.ql.io.orc.WriterImpl$IntegerTreeWriter.write(WriterImpl.java:929)
     12  org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1607)
     12  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
     11  org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getLeft(RedBlackTree.java:92)
     11  org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.add(DynamicIntArray.java:105)
     10  org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
     ...

  was:
Currently, ORCFile's String Dictionary uses StringRedBlackTree for adding/finding/sorting
duplicate strings.  When there are a large number of unique strings (let's say over 16K) and
a large number of rows (let's say 1M), the binary search will take O(1M * log(16K)) time which
can be very long.

Alternatively, ORCFile's String Dictionary can use HashMap for adding/finding duplicate strings,
and use quicksort at the end to produce a sorted order.  In the same case above, the total
time spent will be O(1M + 16K * log(16K)) which is much faster.

When the number of unique string is close to the number of rows (let's say, both around 1M),
ORC will automatically disable the dictionary encoding.  In the old approach will take O(1M
* log(1M)), and our new approach will take O(1M) since we can skip the final quicksort if
the dictionary encoding is disabled.

So in either case, the new approach should be a win.


Here is an PMP output based on ~600 traces (so 126 means 126/600 ~= 21% of total time). It's
a query like "INSERT OVERWRITE TABLE SELECT * FROM src" using hive-1.1.0-cdh-5.4.1.

    126  org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:67)
     35  java.util.zip.Deflater.deflateBytes(Native Method)
     26  org.apache.hadoop.hive.ql.io.orc.SerializationUtils.findClosestNumBits(SerializationUtils.java:218)
     24  org.apache.hadoop.hive.serde2.lazy.LazyNonPrimitive.isNull(LazyNonPrimitive.java:63)
     22  org.apache.hadoop.hive.serde2.lazy.LazyMap.parse(LazyMap.java:204)
     22  org.apache.hadoop.hive.serde2.lazy.LazyLong.parseLong(LazyLong.java:116)
     21  org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:111)
     19  org.apache.hadoop.hive.serde2.lazy.LazyPrimitive.hashCode(LazyPrimitive.java:57)
     18  org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getRight(RedBlackTree.java:99)
     16  org.apache.hadoop.hive.ql.io.RCFile$Reader.getCurrentRow(RCFile.java:1932)
     15  org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
     15  org.apache.hadoop.hive.ql.io.orc.WriterImpl$IntegerTreeWriter.write(WriterImpl.java:929)
     12  org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1607)
     12  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
     11  org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getLeft(RedBlackTree.java:92)
     11  org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.add(DynamicIntArray.java:105)
     10  org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
     ...


> Make ORCFile's String Dictionary more efficient
> -----------------------------------------------
>
>                 Key: HIVE-11188
>                 URL: https://issues.apache.org/jira/browse/HIVE-11188
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>    Affects Versions: 1.2.0, 1.1.0
>            Reporter: Zheng Shao
>
> Currently, ORCFile's String Dictionary uses StringRedBlackTree for adding/finding/sorting
duplicate strings.  When there are a large number of unique strings (let's say over 16K) and
a large number of rows (let's say 1M), the binary search will take O(1M * log(16K)) time which
can be very long.
> Alternatively, ORCFile's String Dictionary can use HashMap for adding/finding duplicate
strings, and use quicksort at the end to produce a sorted order.  In the same case above,
the total time spent will be O(1M + 16K * log(16K)) which is much faster.
> When the number of unique string is close to the number of rows (let's say, both around
1M), ORC will automatically disable the dictionary encoding.  In the old approach will take
O(1M * log(1M)), and our new approach will take O(1M) since we can skip the final quicksort
if the dictionary encoding is disabled.
> So in either case, the new approach should be a win.
> Here is an PMP output based on ~600 traces (so 126 means 126/600 ~= 21% of total time).
It's a query like "INSERT OVERWRITE TABLE target SELECT * FROM src" using hive-1.1.0-cdh-5.4.1.
target TABLE is STORED AS ORC, and src TABLE is STORED AS RCFILE.
>     126  org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:67)
>      35  java.util.zip.Deflater.deflateBytes(Native Method)
>      26  org.apache.hadoop.hive.ql.io.orc.SerializationUtils.findClosestNumBits(SerializationUtils.java:218)
>      24  org.apache.hadoop.hive.serde2.lazy.LazyNonPrimitive.isNull(LazyNonPrimitive.java:63)
>      22  org.apache.hadoop.hive.serde2.lazy.LazyMap.parse(LazyMap.java:204)
>      22  org.apache.hadoop.hive.serde2.lazy.LazyLong.parseLong(LazyLong.java:116)
>      21  org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:111)
>      19  org.apache.hadoop.hive.serde2.lazy.LazyPrimitive.hashCode(LazyPrimitive.java:57)
>      18  org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getRight(RedBlackTree.java:99)
>      16  org.apache.hadoop.hive.ql.io.RCFile$Reader.getCurrentRow(RCFile.java:1932)
>      15  org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
Method)
>      15  org.apache.hadoop.hive.ql.io.orc.WriterImpl$IntegerTreeWriter.write(WriterImpl.java:929)
>      12  org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1607)
>      12  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
>      11  org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getLeft(RedBlackTree.java:92)
>      11  org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.add(DynamicIntArray.java:105)
>      10  org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
>      ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message