hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergio Pena" <sergio.p...@cloudera.com>
Subject Re: Review Request 30281: Move parquet serialize implementation to DataWritableWriter to improve write speeds
Date Thu, 29 Jan 2015 17:12:45 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/
-----------------------------------------------------------

(Updated Ene. 29, 2015, 5:12 p.m.)


Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


Changes
-------

Patch with Ferd changes recommendations.
I also checking for the inspector category on writeValue() in order to pass the correct object
inspector to the rest of the methods. I thinkg this makes other methods clean.


Bugs: HIVE-9333
    https://issues.apache.org/jira/browse/HIVE-9333


Repository: hive-git


Description
-------

This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter class
in order to save time in materializing data on serialize().


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java 060b1b722d32f3b2f88304a1a73eb249e150294b

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 41b5f1c3b0ab43f734f8a211e3e03d5060c75434

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28

  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96

  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java 667d3671547190d363107019cd9a2d105d26d336

  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15

  serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetWritable.java PRE-CREATION 

Diff: https://reviews.apache.org/r/30281/diff/


Testing
-------

The tests run were the following:

1. JMH (Java microbenchmark)

This benchmark called parquet serialize/write methods using text writable objects. 

Class.method                  Before Change (ops/s)      After Change (ops/s)       
-------------------------------------------------------------------------------
ParquetHiveSerDe.serialize:          19,113                   249,528   ->  19x speed increase
DataWritableWriter.write:             5,033                     5,201   ->  3.34% speed
increase


2. Write 20 million rows (~1GB file) from Text to Parquet

I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the following
statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;

Time (s) it took to write the whole file BEFORE changes: 93.758 s
Time (s) it took to write the whole file AFTER changes: 83.903 s

It got a 10% of speed inscrease.


Thanks,

Sergio Pena


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message