Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 640C9200B36 for ; Wed, 6 Jul 2016 13:23:35 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 62CFA160A64; Wed, 6 Jul 2016 11:23:35 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B3FBA160A36 for ; Wed, 6 Jul 2016 13:23:34 +0200 (CEST) Received: (qmail 53059 invoked by uid 500); 6 Jul 2016 11:23:33 -0000 Mailing-List: contact dev-help@hawq.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hawq.incubator.apache.org Delivered-To: mailing list dev@hawq.incubator.apache.org Received: (qmail 53047 invoked by uid 99); 6 Jul 2016 11:23:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jul 2016 11:23:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0DCE0C18B4 for ; Wed, 6 Jul 2016 11:23:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.801 X-Spam-Level: X-Spam-Status: No, score=-0.801 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id lurzS7oUlcRS for ; Wed, 6 Jul 2016 11:23:30 +0000 (UTC) Received: from mail-pf0-f193.google.com (mail-pf0-f193.google.com [209.85.192.193]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 116585F19A for ; Wed, 6 Jul 2016 11:23:30 +0000 (UTC) Received: by mail-pf0-f193.google.com with SMTP id c74so21641354pfb.0 for ; Wed, 06 Jul 2016 04:23:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-transfer-encoding:mime-version:subject:message-id:date :references:in-reply-to:to; bh=EgJFqOv4Alb1jNe6l3+rBmyqvAh8Np1Cza5n85YjPuQ=; b=uZUOt6nfy2tiKtE1cTFS7Lg5pWkaQWGhytbln/0aVWsl7FAoyrsvcAQso+YjtFeRMe kvJH0qfs+imApU6Fe8KnihJU/YYiYuj+dAWtrA46YTkEdohtQu2cpNoECPmvhmWXj9LY mCpo0/jpa5yHeHvnJcaRByntG1o2t9Mdp16B6dxhNN+hQiDvSKnPaPHjXRgf+L3utHw5 9XI/Sbrqe/egMB2kdvM7okW+BvGgwzjGZp2zvHM8ktOQexiev5mT514f3vVmDjQPoGAv QB1uhM2qrK0S4Kos6iKfPrwx5j4j5w000gByqI6ANXSxoEIsVOZy944j+sdFh3azep3e 6uEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-transfer-encoding:mime-version :subject:message-id:date:references:in-reply-to:to; bh=EgJFqOv4Alb1jNe6l3+rBmyqvAh8Np1Cza5n85YjPuQ=; b=KidbpEhzZYFYm+jRcf18zV94wTT4QapCSR+ZQFlpX8RO5479iE99Pnf+CiNwETvFSK ztUtbtZZ07XYLGHTgiFqBP14bDrQtpslceIDOVJE4+hg+9UKRuftjvoD+cJ3EcoSK3Gx joAjUpidFAfmGJPlyRKO+4BhE7y3s6+Aao5FvR9P/6kqPLVY6akYgBc+48BCgn1bqr2O k/PuhtdLs8znXuYYIG8VHdjcjG2UnLP0rpuCXzdYnc1rr1Io8qOpdlX32AhYSn6BzenY JZrHhCvc5vtahPa5j8AD4eiR30Qn/O1YKny8E/67o12dVohZY8UBaiUZx7aWoHFfdbyq qvlA== X-Gm-Message-State: ALyK8tI/pFwh+EdGxVO6ksjNnOJVCzkFp6gWHy3id78X74R3bgbfpj/b+R28NLYIB/R7pg== X-Received: by 10.98.152.76 with SMTP id q73mr41664637pfd.38.1467804208485; Wed, 06 Jul 2016 04:23:28 -0700 (PDT) Received: from [10.11.2.62] (li1459-136.members.linode.com. [139.162.46.136]) by smtp.gmail.com with ESMTPSA id xs12sm4194284pac.7.2016.07.06.04.23.27 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 06 Jul 2016 04:23:27 -0700 (PDT) From: Gmail Content-Type: text/plain; charset=gb2312 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Subject: Re: [Propose] More data skipping technology for IO intensive performance enhancement Message-Id: <1483D995-6784-49EC-8F55-013297CDBDA1@gmail.com> Date: Wed, 6 Jul 2016 19:23:20 +0800 References: In-Reply-To: To: dev@hawq.incubator.apache.org X-Mailer: iPhone Mail (13F69) archived-at: Wed, 06 Jul 2016 11:23:35 -0000 BTW, could you create some related issues in JIRA?=20 Thanks xunzhang Send from my iPhone > =D4=DA 2016=C4=EA7=D4=C22=C8=D5=A3=AC23:19=A3=ACMing Li =D0= =B4=B5=C0=A3=BA >=20 > Data skipping technology can extremely avoiding unnecessary IO, so it can= > extremely enhance performance for IO intensive query. Including eliminatin= g > query on unnecessary table partition according to the partition key range ,= > I think more options are available now: >=20 > (1) Parquet / ORC format introduce a lightweight meta data info like > Min/Max/Bloom filter for each block, such meta data can be exploited when > predicate/filter info can be fetched before executing scan. >=20 > However now in HAWQ, all data in parquet need to be scanned into memory > before processing predicate/filter. We don't generate the meta info when > INSERT into parquet table, the scan executor doesn't utilize the meta info= > neither. Maybe some scan API need to be refactored so that we can get > predicate/filter > info before executing base relation scan. >=20 > (2) Base on (1) technology, especially with Bloom filter, more optimizer > technology can be explored furthur. E.g. Impala implemented Runtime > filtering(*https://www.cloudera.com/documentation/enterprise/latest/topics= /impala_runtime_filtering.html > * > ), which can be used at > - dynamic partition pruning > - converting join predicate to base relation predicate >=20 > It tell the executor to wait for one moment(the interval time can be set i= n > guc) before executing base relation scan, if the interested values(e.g. th= e > column in join predicate only have very small set) arrived in time, it can= > use these value to filter this scan, if doesn't arrived in time, it scan > without this filter, which doesn't impact result correctness. >=20 > Unlike (1) technology, this technology cannot be used in any case, it only= > outperform in some cases. So it just add some more query plan > choices/paths, and the optimizer need based on statistics info to calculat= e > the cost, and apply it when cost down. >=20 > All in one, maybe more similar technology can be adoptable for HAWQ now, > let's start to think about performance related technology, moreover we nee= d > to instigate how these technology can be implemented in HAWQ. >=20 > Any ideas or suggestions are welcomed? Thanks.