impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4725: Query option to control Parquet array resolution.
Date Fri, 03 Mar 2017 20:20:37 GMT
Alex Behm has uploaded a new change for review.

Change subject: IMPALA-4725: Query option to control Parquet array resolution.

IMPALA-4725: Query option to control Parquet array resolution.

Summary of changes:
Introduces a new query option PARQUET_ARRAY_RESOLUTION to
control the path-resoution behavior for Parquet files
with nested arrays. The values are:
  Assumes arrays are encoded with the 3-level representation.
  Also resolves arrays encoded with a single level.
  Does not attempt a 2-level resolution.
  Assumes arrays are encoded with the 2-level representation.
  Also resolves arrays encoded with a single level.
  Does not attempt a 3-level resolution.
  First tries to resolve assuming the 2-level representation,
  and if unsuccessful, tries the 3-level representation.
  Also resolves arrays encoded with a single level.
  This is the current Impala behavior and is used as the
  default value for compatibility.

Note that 'failure' to resolve a schema path with a given
array-resoution policy does not necessarily mean a warning or
error is returned by the query. A mismatch might be treated
like a missing field which is necessary to support schema
evolution. There is no way to reliably distinguish the
'bad resolution' and 'legitimately missing field' cases.

The new query option is independent of and can be combined

Arrays can be represented in several ways in Parquet:
- Three Level Encoding (standard)
- Two Level Encoding (legacy)
- One Level Encoding (legacy)
More details are in the "Lists" section of the spec:

Unfortunately, there is no reliable metadata within Parquet files
to indicate which encoding was used. There is even the possibility
of having mixed encodings within the same file if there are multiple

As a result, Impala currently tries to auto-detect the file encoding
when resolving a schema path in a Parquet file using the

However, regardless of whether a Parquet data file uses the 2-level
or 3-level encoding, the index-based resolution may return incorrect
results if the representation in the Parquet file does not
exactly match the attempted array-resoution policy. Intuitively,
when attempting a 2-level resolution on a 3-level file, the matched
schema node may not be deep enough in the schema tree, but could still
be a scalar node with expected type. Similarly, when attempting a
3-level resolution on a 2-level file a level may be incorrectly

The name-based policy generally does not have this problem because it
avoids traversing incorrect schema paths. However, the index-based
resoution allows a different set of schema-evolution operations,
so just using name-based resolution is not an acceptable workaround
in all cases.

- Added new Parquet data files that show how incorrect results
  can be returned with a mismatched file encoding and resolution
  policy. Added both 2-level and 3-level versions of the data.
- Added a new test in that shows the behavior
  with the new PARQUET_ARRAY_RESOLUTION query option.
- Locally ran and on core.

Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907
M be/src/exec/
M be/src/exec/
M be/src/exec/parquet-metadata-utils.h
M be/src/service/
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/data/schemas/nested/modern_nested.parquet
M testdata/data/schemas/nested/nested.avsc
M testdata/data/schemas/nested/nested.json
A testdata/parquet_nested_types_encodings/AmbiguousList.avsc
A testdata/parquet_nested_types_encodings/AmbiguousList.json
A testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet
A testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet
A testdata/parquet_nested_types_encodings/README
A testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
M tests/query_test/
18 files changed, 455 insertions(+), 33 deletions(-)

  git pull ssh:// refs/changes/50/6250/1
To view, visit
To unsubscribe, visit

Gerrit-MessageType: newchange
Gerrit-Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <>

View raw message