Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4D3A1200CA9 for ; Fri, 2 Jun 2017 03:15:55 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 4BAC9160BDF; Fri, 2 Jun 2017 01:15:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 917FA160BC4 for ; Fri, 2 Jun 2017 03:15:54 +0200 (CEST) Received: (qmail 5435 invoked by uid 500); 2 Jun 2017 01:15:53 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 5392 invoked by uid 99); 2 Jun 2017 01:15:52 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Jun 2017 01:15:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0368218612D for ; Fri, 2 Jun 2017 01:15:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 9l6aTH-UlWCF for ; Fri, 2 Jun 2017 01:15:50 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id BFF015F24C for ; Fri, 2 Jun 2017 01:15:50 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v521Fnxs015534; Fri, 2 Jun 2017 01:15:49 GMT Message-Id: <201706020115.v521Fnxs015534@ip-10-146-233-104.ec2.internal> Date: Fri, 2 Jun 2017 01:15:48 +0000 From: "Zach Amsden (Code Review)" To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Michael Ho , Joe McDonnell , Tim Armstrong Reply-To: zamsden@cloudera.com X-Gerrit-MessageType: newpatchset Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4864_Speed_up_single_slot_predicates_with_dictionaries=0A?= X-Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 X-Gerrit-ChangeURL: X-Gerrit-Commit: d7bc67ec25cce91b156169dadcaa5862f810332b In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.7 archived-at: Fri, 02 Jun 2017 01:15:55 -0000 Zach Amsden has uploaded a new patch set (#14). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries ...................................................................... IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Runs with ASAN, runs without crashes on ee tests. Performance results inconclusive; this may not be worth the complexity unless we get really selective or really expensive predicates. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Performance: Hard to get! Simple predicates did not show improvement, in fact regressed. I used a TPC-H scale 30 dataset, duplicated 3x into a 'biglineitem' table. select count(*) from biglineitem WHERE l_returnflag = 'A'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_shipinstruct = 'DELIVER IN PERSON'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_quantity > 49; 0.93s -> 0.93s select count(*) from biglineitem WHERE instr(l_shipdate, '1994-11') > 0; 2.33s -> 1.03s So this appears to only be a win for expensive predicates. Update: I added changes to make computed predicate costs visible from the frontend to the backend, and tried a TPC-DS scale 10 dataset, which has better queries (lots of IN groups). Still there is a regression in raw query performance. Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/runtime/descriptors.h M be/src/runtime/row-batch.h M be/src/runtime/tuple.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/PlanNode.java 20 files changed, 527 insertions(+), 142 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/14 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 14 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden