impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for remaining types
Date Mon, 08 May 2017 22:12:28 GMT
Hello Marcel Kornacker,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6563

to look at the new patch set (#11).

Change subject: IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for
remaining types
......................................................................

IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for remaining types

This change adds functionality to write and read parquet::Statistics for
Decimal, String, and Timestamp values. As an exception, we don't read
statistics for CHAR columns, since CHAR support is broken in Impala
(IMPALA-1652).

This change also switches from using the deprecated fields 'min' and
'max' to populate the new fields 'min_value' and 'max_value' in
parquet::Statistics, that were added in parquet-format pull request #46.

The HdfsParquetScanner will preferably read the new fields if they are
populated and if the column order 'TypeDefinedOrder' has been used to
compute the statistics. For columns without a column order set or with
only the deprecated fields populated, the scanner will read them only if
they are of simple numeric type, i.e. boolean, integer, or floating
point.

This change removes the validation of the Parquet Statistics we write to
Hive from the tests, since Hive does not write the new fields. Instead
it adds a parquet file written by Hive that uses the deprecated fields
for its statistics. It uses that file to exercise the fallback logic for
supported types in a test.

This change also cleans up the interface of ParquetPlainEncoder in
parquet-common.h.

Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M common/thrift/parquet.thrift
M testdata/data/README
A testdata/data/deprecated_statistics.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_parquet_stats.py
19 files changed, 975 insertions(+), 349 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/6563/11
-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Attila Jeges <attilaj@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>

Mime
View raw message