impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pooja Nilangekar (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] Populate OffsetIndex and ColumnIndex of a row group and Filter pages
Date Tue, 25 Jul 2017 01:39:53 GMT
Pooja Nilangekar has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7465

Change subject: Populate OffsetIndex and ColumnIndex of a row_group and Filter pages
......................................................................

Populate OffsetIndex and ColumnIndex of a row_group and Filter pages

The statistics for each page in a ColumnChunk of a RowGroup are
added to the ColumnIndex structure. When a page is flushed to the
file, its location and offset of the first row is added to the
PageLocation structure of the Offset index. If a file is found to
have only one row_group when it is Finalized, ColumnIndex for each
Column is written to the file (just before footer) and its length
and offset is populated in the ColumnChunk. The OffsetIndexes of all
the columns  in the row_group are written to the RowGroupOffsetIndex
structure and written out to the file. The offset and length of the
index is written out to the RowGroup. This ensures that the rage
scans and point look ups can skip pages based on these statistics
while at the same time scans without selective predicates do not
incur any overhead. Space efficiency is ensured by not populating
parquet::Statistics in the ColumnMeta when the statistics are written
to the ColumnIndex. Additionally, for ordered columns, the
ColumnIndex only contains the min_values.

While scanning a RowGroup, the HdfsParquetScanner invokes the
ParquetIndexFilter for the RowGroups where the indexes are present.
The filter evaluates each conjunct against each page of the
corresponding column. It consolidates the RowRanges for the given
RowGroup and returns the final set of pages to be scanned for each
column.

Testing: The populated index structures were deserialized from the
parquet file and the validity of the offsets and statistics were
verified.

The filtered index ranges were verified manually by ensuring that
the filtered ranges would always evaluate the min/max conjuncts to
true.

Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
A be/src/exec/parquet-index-filter.cc
A be/src/exec/parquet-index-filter.h
M bin/impala-config.sh
M common/thrift/parquet.thrift
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_index.py
12 files changed, 589 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/7465/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7465
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <pooja.nilangekar@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>

Mime
View raw message