hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saket Saurabh <ssaur...@apache.org>
Subject Re: Review Request 50934: HIVE-14233 Improve vectorization for ACID by eliminating row-by-row stitching
Date Thu, 11 Aug 2016 23:36:30 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50934/
-----------------------------------------------------------

(Updated Aug. 11, 2016, 4:36 p.m.)


Review request for hive and Eugene Koifman.


Repository: hive-git


Description
-------

https://issues.apache.org/jira/browse/HIVE-14233
This JIRA proposes to improve vectorization for ACID by eliminating row-by-row stitching when
reading back ACID files. In the current implementation, a vectorized row batch is created
by populating the batch one row at a time, before the vectorized batch is passed up along
the operator pipeline. This row-by-row stitching limitation was because of the fact that the
ACID insert/update/delete events from various delta files needed to be merged together before
the actual version of a given row was found out. HIVE-14035 has enabled us to break away from
that limitation by splitting ACID update events into a combination of delete+insert. In fact,
it has now enabled us to create splits on delta files.
Building on top of HIVE-14035, this JIRA proposes to solve this earlier bottleneck in the
vectorized code path for ACID by now directly reading row batches from the underlying ORC
files and avoiding any stitching altogether. Once a row batch is read from the split (which
may be on a base/delta file), the deleted rows will be found by cross-referencing them against
a data structure that will just keep track of deleted events (found in the deleted_delta files).
This will lead to a large performance gain when reading ACID files in vectorized fashion,
while enabling further optimizations in future that can be done on top of that.


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java 334cb31c5406f500c122a11eccef25b92d357cd4

  ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java e46ca51eff9c230147166e9428d7f462d2f9e772

  ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java PRE-CREATION

  ql/src/test/queries/clientpositive/acid_vectorization.q 832909bdb1bc79e01163373beed03eaaffcefd3d

  ql/src/test/results/clientpositive/acid_vectorization.q.out 1792979156ec361c85882ac8b6968e93d42b5f31


Diff: https://reviews.apache.org/r/50934/diff/


Testing
-------


Thanks,

Saket Saurabh


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message