hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cheng xu" <cheng.a...@intel.com>
Subject Re: Review Request 30281: Move parquet serialize implementation to DataWritableWriter to improve write speeds
Date Tue, 27 Jan 2015 01:59:33 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/#review69723
-----------------------------------------------------------


Thank you for your patch. I have serveral general questions as follows.


ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java
<https://reviews.apache.org/r/30281/#comment114475>

    If compressionType is unneeded, this annotation may be removed as well.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java
<https://reviews.apache.org/r/30281/#comment114474>

    Why remove compressionType code here?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment114487>

    Why not define writeGroupFields with a parameter of ParquetWritable instead of parsing
in object and objectInspector seperatedly?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment114488>

    Assume if i%2 equals 0, it means the key. And only the key's value is not null, we'll
write the value. What if comes a null value for both the key and value? Can we use the way
like the original way that pass in the writable object and handle the null value case in the
writeValue method. The code can become more simple and easy to understand.


- cheng xu


On Jan. 27, 2015, 1:39 a.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30281/
> -----------------------------------------------------------
> 
> (Updated Jan. 27, 2015, 1:39 a.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-9333
>     https://issues.apache.org/jira/browse/HIVE-9333
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter
class in order to save time in materializing data on serialize().
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
060b1b722d32f3b2f88304a1a73eb249e150294b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 41b5f1c3b0ab43f734f8a211e3e03d5060c75434

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java
e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96

>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java
667d3671547190d363107019cd9a2d105d26d336 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15

>   serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetWritable.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/30281/diff/
> 
> 
> Testing
> -------
> 
> The tests run were the following:
> 
> 1. JMH (Java microbenchmark)
> 
> This benchmark called parquet serialize/write methods using text writable objects. 
> 
> Class.method                  Before Change (ops/s)      After Change (ops/s)       
> -------------------------------------------------------------------------------
> ParquetHiveSerDe.serialize:          19,113                   249,528   ->  19x speed
increase
> DataWritableWriter.write:             5,033                     5,201   ->  3.34%
speed increase
> 
> 
> 2. Write 20 million rows (~1GB file) from Text to Parquet
> 
> I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the
following
> statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;
> 
> Time (s) it took to write the whole file BEFORE changes: 93.758 s
> Time (s) it took to write the whole file AFTER changes: 83.903 s
> 
> It got a 10% of speed inscrease.
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message