Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 43F4B200C39 for ; Thu, 2 Mar 2017 00:06:20 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 42972160B78; Wed, 1 Mar 2017 23:06:20 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 64426160B70 for ; Thu, 2 Mar 2017 00:06:19 +0100 (CET) Received: (qmail 79168 invoked by uid 500); 1 Mar 2017 23:06:18 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 79156 invoked by uid 99); 1 Mar 2017 23:06:18 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Mar 2017 23:06:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id DC361C05AA for ; Wed, 1 Mar 2017 23:06:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id OYXPEBC_1ORk for ; Wed, 1 Mar 2017 23:06:16 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id A81DE5F1EE for ; Wed, 1 Mar 2017 23:06:15 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v21N6Ejr000366; Wed, 1 Mar 2017 23:06:14 GMT Message-Id: <201703012306.v21N6Ejr000366@ip-10-146-233-104.ec2.internal> Date: Wed, 1 Mar 2017 23:06:13 +0000 From: "Joe McDonnell (Code Review)" To: Marcel Kornacker , Lars Volker , impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Matthew Mulder , Mostafa Mokhtar , Alex Behm , Tim Armstrong Reply-To: joemcdonnell@cloudera.com X-Gerrit-MessageType: newpatchset Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4624=3A_Implement_Parquet_dictionary_filtering=0A?= X-Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 X-Gerrit-ChangeURL: X-Gerrit-Commit: 56973b44224323aa7bf3efe66a39e03296d2ff2d In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.7 archived-at: Wed, 01 Mar 2017 23:06:20 -0000 Hello Marcel Kornacker, Impala Public Jenkins, Lars Volker, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5904 to look at the new patch set (#17). Change subject: IMPALA-4624: Implement Parquet dictionary filtering ...................................................................... IMPALA-4624: Implement Parquet dictionary filtering Here is a basic summary of the changes: Frontend looks for conjuncts that operate on a single slot and pass a map from slot id to the conjunct index through thrift to the backend. The conjunct indices are the indices into the normal PlanNode conjuncts list. The conjuncts need to satisfy certain conditions: 1. They are bound on a single slot 2. They are deterministic (no random functions) 3. They evaluate to FALSE on a NULL input. This is because the dictionary does not include NULLs, so any condition that evaluates to TRUE on NULL cannot be evaluated by looking only at the dictionary. The backend converts the indices into ExprContexts. These are cloned in the scanner threads. The dictionary read codepath has been removed from ReadDataPage into its own function, InitDictionary. This has also been turned into its own step in row group initialization. ReadDataPage will not see any dictionary pages unless the parquet file is invalid. For dictionary filtering, we initialize dictionaries only as needed to evaluate the conjuncts. The Parquet scanner evaluates the dictionary filter conjuncts on the dictionary to see if any dictionary entry passes. If no entry passes, the row group is eliminated. If the row group passes the dictionary filtering, then we initialize all remaining dictionaries. Dictionary filtering is controlled by a new query option, parquet_dictionary_filtering, which is on by default. Since column chunks can have a mixture of encodings, dictionary filtering uses three tests to determine whether this is purely dictionary encoded: 1. If the encoding_stats is in the parquet file, then use it to determine if there are only dictionary encoded pages (i.e. there are no data pages with an encoding other than PLAIN_DICTIONARY). -OR- 2. If the encoding stats are not present, then look at the encodings. The column is purely dictionary encoded if: a) PLAIN_DICTIONARY is present AND b) Only PLAIN_DICTIONARY, RLE, or BIT_PACKED encodings are listed -OR- 3. If this file was written by an older version of Impala, then we know that dictionary failover happens when the dictionary reaches 40,000 values. Dictionary filtering can proceed as long as the dictionary is smaller than that. parquet-mr writes the encoding list correctly in the current version in our environment (1.5.0). This means that check #2 works on some existing files (potentially most existing parquet-mr files). parquet-mr writes the encoding stats starting in 1.9.0. This is the version where check #1 will start working. Impala's parquet writer now implements both, so either check above will work. Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M common/thrift/parquet.thrift M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test A testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test A testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet-filtering.test A testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test M tests/query_test/test_scanners.py 27 files changed, 1,444 insertions(+), 192 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/5904/17 -- To view, visit http://gerrit.cloudera.org:8080/5904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 Gerrit-PatchSet: 17 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tim Armstrong