hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergio Pena" <sergio.p...@cloudera.com>
Subject Re: Review Request 30281: Move parquet serialize implementation to DataWritableWriter to improve write speeds
Date Tue, 27 Jan 2015 16:32:18 GMT


> On Ene. 27, 2015, 1:59 a.m., cheng xu wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java, line
130
> > <https://reviews.apache.org/r/30281/diff/1/?file=834396#file834396line130>
> >
> >     Why remove compressionType code here?

I removed the code because it was unused. The compressionType variable is private and it is
not used on any other part of the code. Maybe it was part of a change a developer wanted to
do, but it did not finished it.


> On Ene. 27, 2015, 1:59 a.m., cheng xu wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java,
line 70
> > <https://reviews.apache.org/r/30281/diff/1/?file=834398#file834398line70>
> >
> >     Why not define writeGroupFields with a parameter of ParquetWritable instead
of parsing in object and objectInspector seperatedly?

I did that at the beginning, but I had to create another ParquetWritable() object everytime
I called writeGroup() method. So I wanted to save time on memory allocation as writeGroup()
is called many times if you have a STRUCT data type in your schema.


> On Ene. 27, 2015, 1:59 a.m., cheng xu wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java,
lines 225-229
> > <https://reviews.apache.org/r/30281/diff/1/?file=834398#file834398line225>
> >
> >     Assume if i%2 equals 0, it means the key. And only the key's value is not null,
we'll write the value. What if comes a null value for both the key and value? Can we use the
way like the original way that pass in the writable object and handle the null value case
in the writeValue method. The code can become more simple and easy to understand.

Thanks. I did the necessary changes to move the startField/endField to other methods in order
to make the code more clrear and readable.


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/#review69723
-----------------------------------------------------------


On Ene. 27, 2015, 1:39 a.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30281/
> -----------------------------------------------------------
> 
> (Updated Ene. 27, 2015, 1:39 a.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-9333
>     https://issues.apache.org/jira/browse/HIVE-9333
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter
class in order to save time in materializing data on serialize().
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
060b1b722d32f3b2f88304a1a73eb249e150294b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 41b5f1c3b0ab43f734f8a211e3e03d5060c75434

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java
e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96

>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java
667d3671547190d363107019cd9a2d105d26d336 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15

>   serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetWritable.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/30281/diff/
> 
> 
> Testing
> -------
> 
> The tests run were the following:
> 
> 1. JMH (Java microbenchmark)
> 
> This benchmark called parquet serialize/write methods using text writable objects. 
> 
> Class.method                  Before Change (ops/s)      After Change (ops/s)       
> -------------------------------------------------------------------------------
> ParquetHiveSerDe.serialize:          19,113                   249,528   ->  19x speed
increase
> DataWritableWriter.write:             5,033                     5,201   ->  3.34%
speed increase
> 
> 
> 2. Write 20 million rows (~1GB file) from Text to Parquet
> 
> I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the
following
> statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;
> 
> Time (s) it took to write the whole file BEFORE changes: 93.758 s
> Time (s) it took to write the whole file AFTER changes: 83.903 s
> 
> It got a 10% of speed inscrease.
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message