impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pooja Nilangekar (Code Review)" <>
Subject [Impala-ASF-CR] Populate OffsetIndex and ColumnIndex of a row group and Filter pages
Date Wed, 26 Jul 2017 18:38:01 GMT
Pooja Nilangekar has uploaded a new patch set (#3).

Change subject: Populate OffsetIndex and ColumnIndex of a row_group and Filter pages

Populate OffsetIndex and ColumnIndex of a row_group and Filter pages

The statistics for each page in a ColumnChunk of a RowGroup are
added to the ColumnIndex structure. When a page is flushed to the
file, its location and offset of the first row is added to the
PageLocation structure of the Offset index. If a file is found to
have only one row_group when it is Finalized, ColumnIndex for each
Column is written to the file (just before footer) and its length
and offset is populated in the ColumnChunk. The OffsetIndexes of all
the columns  in the row_group are written to the RowGroupOffsetIndex
structure and written out to the file. The offset and length of the
index is written out to the RowGroup. This ensures that the rage
scans and point look ups can skip pages based on these statistics
while at the same time scans without selective predicates do not
incur any overhead. Space efficiency is ensured by not populating
parquet::Statistics in the ColumnMeta when the statistics are written
to the ColumnIndex. Additionally, for ordered columns, the
ColumnIndex only contains the min_values.

While scanning a RowGroup, the HdfsParquetScanner invokes the
ParquetIndexFilter for the RowGroups where the indexes are present.
The filter evaluates each conjunct against each page of the
corresponding column. It consolidates the RowRanges for the given
RowGroup and returns the final set of pages to be scanned for each

Testing: The populated index structures were deserialized from the
parquet file and the validity of the offsets and statistics were

The filtered index ranges were verified manually by ensuring that
the filtered ranges would always evaluate the min/max conjuncts to

Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
M be/src/exec/CMakeLists.txt
M be/src/exec/
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/
M be/src/exec/
M be/src/exec/parquet-column-stats.h
A be/src/exec/
A be/src/exec/parquet-index-filter.h
M bin/
M common/thrift/parquet.thrift
M tests/query_test/
M tests/util/
12 files changed, 573 insertions(+), 54 deletions(-)

  git pull ssh:// refs/changes/65/7465/3
To view, visit
To unsubscribe, visit

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Pooja Nilangekar <>

View raw message