Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1EB8D200C24 for ; Thu, 23 Feb 2017 17:03:59 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 1D54F160B62; Thu, 23 Feb 2017 16:03:59 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 36AEB160B50 for ; Thu, 23 Feb 2017 17:03:58 +0100 (CET) Received: (qmail 58333 invoked by uid 500); 23 Feb 2017 16:03:57 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 58321 invoked by uid 99); 23 Feb 2017 16:03:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Feb 2017 16:03:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 87475182765 for ; Thu, 23 Feb 2017 16:03:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 2YMGT33lXJAt for ; Thu, 23 Feb 2017 16:03:53 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id A11F75FAD8 for ; Thu, 23 Feb 2017 16:03:52 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v1NG3pYD006772; Thu, 23 Feb 2017 16:03:51 GMT Message-Id: <201702231603.v1NG3pYD006772@ip-10-146-233-104.ec2.internal> Date: Thu, 23 Feb 2017 16:03:50 +0000 From: "Marcel Kornacker (Code Review)" To: Joe McDonnell , impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Alex Behm , Lars Volker , Mostafa Mokhtar , Matthew Mulder Reply-To: marcel@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4624=3A_Implement_Parquet_dictionary_filtering=0A?= X-Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 X-Gerrit-ChangeURL: X-Gerrit-Commit: 7289898ac854472daf78dc0f38a751116ea1f1d8 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.7 archived-at: Thu, 23 Feb 2017 16:03:59 -0000 Marcel Kornacker has posted comments on this change. Change subject: IMPALA-4624: Implement Parquet dictionary filtering ...................................................................... Patch Set 9: (17 comments) http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 71: // is guaranteed to be true for Impala versions of 2.9 or below. "versions 2.9" Line 599: // TODO: The values should be converted at dictionary read and store "during dictionary construction and stored" Line 600: // in converted form in the dictionary. let's tackle this particular todo immediately after the current patch goes in, having dictionary filtering for string columns is going to be very useful. Line 699: if (file_version_.application == "impala" && file_version_.VersionLt(2,10,0)) { this will need to change to 2.9 later, because 2.9 will support encoding_stats Line 706: BaseScalarColumnReader* scalar_reader = why not make this list contain BaseScalarColumnReader*? Line 715: // This reader is not 100% dictionary encoded, so dictionary filters i think it's more accurate to say "we cannot say with certainty whether the columns if 100% ..." http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 443: std::vector dict_filter_tuple_backing_; you can use ScopedBuffer here. take a look at https://gerrit.cloudera.org/#/c/6032/11/be/src/exec/hdfs-parquet-scanner.h Line 620: /// structures needed for evaluating dictionary filtering. point out which member vars get populated (= side effects) in other words, you could have also phrased this as "populates dict_filterable_readers_ and non_dict_filterable_readers_". http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 761: Status BaseScalarColumnReader::ReadPageHeader(bool peek, indentation Line 840: // is not a dictionary, then there is no dictionary. how does this interact with the dictionary_page_offset? do we ignore that? Line 849: if (UNLIKELY(data_size < 0)) { these checks aren't really related to dictionaries, can't we do that elsewhere? fine to leave todo. http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 366: // Initializes the dictionary, if it exists point out specific state that changes http://gerrit.cloudera.org:8080/#/c/5904/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 698: return output.toString(); > Added a comment. I'm sorting to display the predicates in the same order as but isn't the unsorted order the one in which they're being applied? http://gerrit.cloudera.org:8080/#/c/5904/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 332: * dictionaryFilterConjuncts_ map. you can leave it as-is, i don't mean to belabor the point, but putting this more concisely would also be fine: "Walks through conjuncts and populates dictionaryFilterConjuncts_." (you already described what that variable means elsewhere) Line 682: if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) { i think the min/max predicates are output at extended level, might as well follow that. Line 693: for (Integer idx : totalIdxList) { exprList.add(conjuncts_.get(idx)); } remove {} http://gerrit.cloudera.org:8080/#/c/5904/9/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test: Line 23: dictionary filter predicates: int_col > 1 lars called his "parquet statistics predicates', so let's call these 'parquet dictionary predicates' -- To view, visit http://gerrit.cloudera.org:8080/5904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes