Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 82AE5200C44 for ; Mon, 27 Mar 2017 20:50:02 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 81886160B85; Mon, 27 Mar 2017 18:50:02 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C8B32160B7B for ; Mon, 27 Mar 2017 20:50:01 +0200 (CEST) Received: (qmail 19616 invoked by uid 500); 27 Mar 2017 18:50:00 -0000 Mailing-List: contact dev-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@impala.incubator.apache.org Delivered-To: mailing list dev@impala.incubator.apache.org Received: (qmail 19601 invoked by uid 99); 27 Mar 2017 18:50:00 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Mar 2017 18:50:00 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id A48931809FD for ; Mon, 27 Mar 2017 18:49:59 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.479 X-Spam-Level: X-Spam-Status: No, score=0.479 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id O7WGWWUspHBX for ; Mon, 27 Mar 2017 18:49:58 +0000 (UTC) Received: from mail-it0-f44.google.com (mail-it0-f44.google.com [209.85.214.44]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 5EE775FC6D for ; Mon, 27 Mar 2017 18:49:58 +0000 (UTC) Received: by mail-it0-f44.google.com with SMTP id y18so86180832itc.0 for ; Mon, 27 Mar 2017 11:49:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=trNGjzS+T4LtXmhn2W06lIVZtQ8M9jOL/SBjz37vm3w=; b=DGTEnXuV217gu4e8HsQhqvCgLgn5q1UImUViSwNbtmGpShjXJ0lV+4RcGpAFqYMTIx f08mt8rOsbn+Fpvyyq3YC3hg7BtaCocC4XOHC0k8LdJLjtf7u4xPesLgp7XBfNx7uIh6 MNqEl3DhbInKxhn3diWLj3D3Al3WJ8J5eUT06nlfPi7ppqsZ1jqv+Xn27Kqup4h2L793 k6j8WwrDpPSMewz8GHQ5gs6njo5JkGu1q2nOaWrn/V+7bK7RxYbGsRgzhgXgK2SLsUqk L5tu/Y5gQ9eRYfvyZzrorSxZtMN2+XEQ1KE2ue/vRdyQKAyUsL6vfstM+n5GsSVE109I s9BQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=trNGjzS+T4LtXmhn2W06lIVZtQ8M9jOL/SBjz37vm3w=; b=WRGDgx4phKQZI2J3fFgHhjgPTYiPSatAn1/b6+jwdPH2u0vI11n1fFlG5U4n9DvHvO qKx8qHtGQ5xGVJXHM0br+yXl9xuU8ZpRvZSYAfupKEh8u7wEuSpOji5dPsb7WvsQsm9T FCZXoT4jq7F/H8DzEtYqwdLh7R2KfDki6j3FFBuzLcFYZBot85PHoGBWTaPDzWgpHBzI CcOMeHrQOjBF0+GQFhMXf+l/zV5V3Lgp0HXngYglZJvms7lyZL05L0Ww9ca2cW2YRvML D6OjoFq7Hg9QaLALitNz1pI2P8I1T3fN8IAs19VjgkX8+iR/yzgwvU4540k4Fzw6wgZ2 kG+w== X-Gm-Message-State: AFeK/H2GtuaW6MGdKTPHHaF5jtGjv8g4+tF8tFv/pHbqwRj+qUD4WgCXwdZZwj06bKhxDl3ayKIc5oLEuuKH35kK X-Received: by 10.107.19.222 with SMTP id 91mr24702603iot.211.1490640596643; Mon, 27 Mar 2017 11:49:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.79.157.26 with HTTP; Mon, 27 Mar 2017 11:49:56 -0700 (PDT) In-Reply-To: References: From: Marcel Kornacker Date: Mon, 27 Mar 2017 11:49:56 -0700 Message-ID: Subject: Re: Min/Max runtime filtering on Impala-Kudu To: dev@impala.incubator.apache.org Content-Type: text/plain; charset=UTF-8 archived-at: Mon, 27 Mar 2017 18:50:02 -0000 On Mon, Mar 27, 2017 at 11:42 AM, Sailesh Mukil wrote: > I will be working on a patch to add min/max filter support in Impala, and > as a first step, specifically target the KuduScanNode, since the Kudu > client is already able to accept a Min and a Max that it would internally > use to filter during its scans. Below is a brief design proposal. > > *Goal:* > > To leverage runtime min/max filter support in Kudu for the potential speed > up of queries over Kudu tables. Kudu does this by taking a min and a max > that Impala will provide and only return values in the range Impala is > interested in. > > *[min <= range we're interested in >= max]* > > *Proposal:* > > > - As a first step, plumb the runtime filter code from > *exec/hdfs-scan-node-base.cc/h > * to *exec/scan-node.cc/h > *, so that it can be applied to *KuduScanNode* > cleanly as well, since *KuduScanNode* and *HdfsScanNodeBase* both > inherit from *ScanNode.* Quick comment: please make sure your solution also applies to KuduScanNodeMt. > - Reuse the *ColumnStats* class (exec/parquet-column-stats.h) or > implement a lighter weight version of it to process and store the Min and > the Max on the build side of the join. > - Once the Min and Max values are added to the existing runtime filter > structures, as a first step, we will ignore the Min and Max values for > non-Kudu tables. Using them for non-Kudu tables can come in as a following > patch(es). > - Similarly, the bloom filter will be ignored for Kudu tables, and only > the Min and Max values will be used, since Kudu does not accept bloom > filters yet. (https://issues.apache.org/jira/browse/IMPALA-3741) > - Applying the bloom filter on the Impala side of the Kudu scan (i.e. in > KuduScanNode) is not in the scope of this patch. > > > *Complications:* > > - We have to make sure that finding the Min and Max values on the build > side doesn't regress certain workloads, since the difference between > generating a bloom filter and generating a Min and a Max, is that a bloom > filter can be type agnostic (we just take a raw hash over the data) whereas > a Min and a Max have to be type specific.