Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 84590200C3E for ; Tue, 7 Mar 2017 07:00:02 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 82D33160B81; Tue, 7 Mar 2017 06:00:02 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A57D9160B76 for ; Tue, 7 Mar 2017 07:00:01 +0100 (CET) Received: (qmail 54194 invoked by uid 500); 7 Mar 2017 06:00:01 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 54144 invoked by uid 99); 7 Mar 2017 06:00:00 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Mar 2017 06:00:00 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1C52CC05AA for ; Tue, 7 Mar 2017 06:00:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.363 X-Spam-Level: X-Spam-Status: No, score=0.363 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id NRdaZukb_R76 for ; Tue, 7 Mar 2017 05:59:57 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 746195F36E for ; Tue, 7 Mar 2017 05:59:57 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v275xvop009596; Tue, 7 Mar 2017 05:59:57 GMT Message-Id: <201703070559.v275xvop009596@ip-10-146-233-104.ec2.internal> Date: Tue, 7 Mar 2017 05:59:56 +0000 From: "Alex Behm (Code Review)" To: Dan Hecht , impala-cr@cloudera.com, reviews@impala.incubator.apache.org Reply-To: alex.behm@cloudera.com X-Gerrit-MessageType: newpatchset Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4725=3A_Query_option_to_control_Parquet_array_resolution=2E=0A?= X-Gerrit-Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907 X-Gerrit-ChangeURL: X-Gerrit-Commit: 51bbe9796dfaaf62321925705cfe2808b400af05 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.7 archived-at: Tue, 07 Mar 2017 06:00:02 -0000 Hello Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6250 to look at the new patch set (#2). Change subject: IMPALA-4725: Query option to control Parquet array resolution. ...................................................................... IMPALA-4725: Query option to control Parquet array resolution. Summary of changes: Introduces a new query option PARQUET_ARRAY_RESOLUTION to control the path-resolution behavior for Parquet files with nested arrays. The values are: - THREE_LEVEL Assumes arrays are encoded with the 3-level representation. Also resolves arrays encoded with a single level. Does not attempt a 2-level resolution. - TWO_LEVEL Assumes arrays are encoded with the 2-level representation. Also resolves arrays encoded with a single level. Does not attempt a 3-level resolution. - TWO_LEVEL_THEN_THREE_LEVEL First tries to resolve assuming the 2-level representation, and if unsuccessful, tries the 3-level representation. Also resolves arrays encoded with a single level. This is the current Impala behavior and is used as the default value for compatibility. Note that 'failure' to resolve a schema path with a given array-resolution policy does not necessarily mean a warning or error is returned by the query. A mismatch might be treated like a missing field which is necessary to support schema evolution. There is no way to reliably distinguish the 'bad resolution' and 'legitimately missing field' cases. The new query option is independent of and can be combined with the existing PARQUET_FALLBACK_SCHEMA_RESOLUTION. Background: Arrays can be represented in several ways in Parquet: - Three Level Encoding (standard) - Two Level Encoding (legacy) - One Level Encoding (legacy) More details are in the "Lists" section of the spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md Unfortunately, there is no reliable metadata within Parquet files to indicate which encoding was used. There is even the possibility of having mixed encodings within the same file if there are multiple arrays. As a result, Impala currently tries to auto-detect the file encoding when resolving a schema path in a Parquet file using the TWO_LEVEL_THEN_THREE_LEVEL policy. However, regardless of whether a Parquet data file uses the 2-level or 3-level encoding, the index-based resolution may return incorrect results if the representation in the Parquet file does not exactly match the attempted array-resoution policy. Intuitively, when attempting a 2-level resolution on a 3-level file, the matched schema node may not be deep enough in the schema tree, but could still be a scalar node with expected type. Similarly, when attempting a 3-level resolution on a 2-level file a level may be incorrectly skipped. The name-based policy generally does not have this problem because it avoids traversing incorrect schema paths. However, the index-based resoution allows a different set of schema-evolution operations, so just using name-based resolution is not an acceptable workaround in all cases. Testing: - Added new Parquet data files that show how incorrect results can be returned with a mismatched file encoding and resolution policy. Added both 2-level and 3-level versions of the data. - Added a new test in test_nested_types.py that shows the behavior with the new PARQUET_ARRAY_RESOLUTION query option. - Locally ran test_scanners.py and test_nested_types.py on core. Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/data/schemas/nested/modern_nested.parquet M testdata/data/schemas/nested/nested.avsc M testdata/data/schemas/nested/nested.json A testdata/parquet_nested_types_encodings/AmbiguousList.avsc A testdata/parquet_nested_types_encodings/AmbiguousList.json A testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet A testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet A testdata/parquet_nested_types_encodings/README A testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test A testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test M tests/query_test/test_nested_types.py 18 files changed, 433 insertions(+), 42 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/6250/2 -- To view, visit http://gerrit.cloudera.org:8080/6250 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht