hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergio Pena" <>
Subject Re: Review Request 30281: Move parquet serialize implementation to DataWritableWriter to improve write speeds
Date Tue, 27 Jan 2015 01:39:07 GMT

This is an automatically generated e-mail. To reply, visit:

(Updated Ene. 27, 2015, 1:39 a.m.)

Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


I forgot to add the BYTE/DECIMAL implementation. This patch contains them.

Bugs: HIVE-9333

Repository: hive-git


This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter class
in order to save time in materializing data on serialize().

Diffs (updated)

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ ea4109d358f7c48d1e2042e5da299475de4a0a29

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ 060b1b722d32f3b2f88304a1a73eb249e150294b

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ 41b5f1c3b0ab43f734f8a211e3e03d5060c75434

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28

  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/ a693aff18516d133abf0aae4847d3fe00b9f1c96

  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/ 667d3671547190d363107019cd9a2d105d26d336

  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/ 007a665529857bcec612f638a157aa5043562a15

  serde/src/java/org/apache/hadoop/hive/serde2/io/ PRE-CREATION 



The tests run were the following:

1. JMH (Java microbenchmark)

This benchmark called parquet serialize/write methods using text writable objects. 

Class.method                  Before Change (ops/s)      After Change (ops/s)       
ParquetHiveSerDe.serialize:          19,113                   249,528   ->  19x speed increase
DataWritableWriter.write:             5,033                     5,201   ->  3.34% speed

2. Write 20 million rows (~1GB file) from Text to Parquet

I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the following
statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;

Time (s) it took to write the whole file BEFORE changes: 93.758 s
Time (s) it took to write the whole file AFTER changes: 83.903 s

It got a 10% of speed inscrease.


Sergio Pena

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message