impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Quanlong Huang (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Date Fri, 29 Sep 2017 08:27:24 GMT
Hello Tim Armstrong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8147

to look at the new patch set (#2).

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node
......................................................................

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet file
using gzip and snappy compression codec will be reported as:
	FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
9 files changed, 111 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <huangquanlong@gmail.com>
Gerrit-Reviewer: Quanlong Huang <huangquanlong@gmail.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message