impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4725: Query option to control Parquet array resolution.
Date Wed, 08 Mar 2017 19:29:26 GMT
Hello Impala Public Jenkins, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6250

to look at the new patch set (#4).

Change subject: IMPALA-4725: Query option to control Parquet array resolution.
......................................................................

IMPALA-4725: Query option to control Parquet array resolution.

Summary of changes:
Introduces a new query option PARQUET_ARRAY_RESOLUTION to
control the path-resolution behavior for Parquet files
with nested arrays. The values are:
- THREE_LEVEL
  Assumes arrays are encoded with the 3-level representation.
  Also resolves arrays encoded with a single level.
  Does not attempt a 2-level resolution.
- TWO_LEVEL
  Assumes arrays are encoded with the 2-level representation.
  Also resolves arrays encoded with a single level.
  Does not attempt a 3-level resolution.
- TWO_LEVEL_THEN_THREE_LEVEL
  First tries to resolve assuming the 2-level representation,
  and if unsuccessful, tries the 3-level representation.
  Also resolves arrays encoded with a single level.
  This is the current Impala behavior and is used as the
  default value for compatibility.

Note that 'failure' to resolve a schema path with a given
array-resolution policy does not necessarily mean a warning or
error is returned by the query. A mismatch might be treated
like a missing field which is necessary to support schema
evolution. There is no way to reliably distinguish the
'bad resolution' and 'legitimately missing field' cases.

The new query option is independent of and can be combined
with the existing PARQUET_FALLBACK_SCHEMA_RESOLUTION.

Background:
Arrays can be represented in several ways in Parquet:
- Three Level Encoding (standard)
- Two Level Encoding (legacy)
- One Level Encoding (legacy)
More details are in the "Lists" section of the spec:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

Unfortunately, there is no reliable metadata within Parquet files
to indicate which encoding was used. There is even the possibility
of having mixed encodings within the same file if there are multiple
arrays.

As a result, Impala currently tries to auto-detect the file encoding
when resolving a schema path in a Parquet file using the
TWO_LEVEL_THEN_THREE_LEVEL policy.

However, regardless of whether a Parquet data file uses the 2-level
or 3-level encoding, the index-based resolution may return incorrect
results if the representation in the Parquet file does not
exactly match the attempted array-resoution policy. Intuitively,
when attempting a 2-level resolution on a 3-level file, the matched
schema node may not be deep enough in the schema tree, but could still
be a scalar node with expected type. Similarly, when attempting a
3-level resolution on a 2-level file a level may be incorrectly
skipped.

The name-based policy generally does not have this problem because it
avoids traversing incorrect schema paths. However, the index-based
resoution allows a different set of schema-evolution operations,
so just using name-based resolution is not an acceptable workaround
in all cases.

Testing:
- Added new Parquet data files that show how incorrect results
  can be returned with a mismatched file encoding and resolution
  policy. Added both 2-level and 3-level versions of the data.
- Added a new test in test_nested_types.py that shows the behavior
  with the new PARQUET_ARRAY_RESOLUTION query option.
- Locally ran test_scanners.py and test_nested_types.py on core.

Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
A testdata/parquet_nested_types_encodings/AmbiguousList.avsc
A testdata/parquet_nested_types_encodings/AmbiguousList.json
A testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet
A testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet
A testdata/parquet_nested_types_encodings/README
A testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
M tests/query_test/test_nested_types.py
15 files changed, 412 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/6250/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6250
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins

Mime
View raw message