hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergio Pena" <sergio.p...@cloudera.com>
Subject Re: Review Request 30281: Move parquet serialize implementation to DataWritableWriter to improve write speeds
Date Tue, 03 Feb 2015 15:58:44 GMT


> On Feb. 3, 2015, 2:02 a.m., Brock Noland wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java, line
485
> > <https://reviews.apache.org/r/30281/diff/4/?file=840165#file840165line485>
> >
> >     hey sorry for being dumb, but it looks like many tests are bding deleted as
part of this change. Is that true or are these duplicate tests or being tested elsewhere?

Ah sorry for not answer you'r last comment. The old negative tests do not apply to the new
change because we're not using Writable objects anymore. I added a couple of negative tests
that makes sure to test the new errors on the class, like testExpectedMapTypeOnRecord() and
testExpectedArrayTypeOnRecord().


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/#review70694
-----------------------------------------------------------


On Ene. 29, 2015, 5:12 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30281/
> -----------------------------------------------------------
> 
> (Updated Ene. 29, 2015, 5:12 p.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-9333
>     https://issues.apache.org/jira/browse/HIVE-9333
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter
class in order to save time in materializing data on serialize().
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
060b1b722d32f3b2f88304a1a73eb249e150294b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 41b5f1c3b0ab43f734f8a211e3e03d5060c75434

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java
e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96

>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java
667d3671547190d363107019cd9a2d105d26d336 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15

>   serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetWritable.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/30281/diff/
> 
> 
> Testing
> -------
> 
> The tests run were the following:
> 
> 1. JMH (Java microbenchmark)
> 
> This benchmark called parquet serialize/write methods using text writable objects. 
> 
> Class.method                  Before Change (ops/s)      After Change (ops/s)       
> -------------------------------------------------------------------------------
> ParquetHiveSerDe.serialize:          19,113                   249,528   ->  19x speed
increase
> DataWritableWriter.write:             5,033                     5,201   ->  3.34%
speed increase
> 
> 
> 2. Write 20 million rows (~1GB file) from Text to Parquet
> 
> I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the
following
> statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;
> 
> Time (s) it took to write the whole file BEFORE changes: 93.758 s
> Time (s) it took to write the whole file AFTER changes: 83.903 s
> 
> It got a 10% of speed inscrease.
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message