Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E3D041874C for ; Thu, 10 Sep 2015 17:40:19 +0000 (UTC) Received: (qmail 578 invoked by uid 500); 10 Sep 2015 17:40:19 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 523 invoked by uid 500); 10 Sep 2015 17:40:19 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 511 invoked by uid 99); 10 Sep 2015 17:40:19 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Sep 2015 17:40:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E126918096B for ; Thu, 10 Sep 2015 17:40:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=maprtech.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id kfP7p5WU_e7s for ; Thu, 10 Sep 2015 17:40:10 +0000 (UTC) Received: from mail-yk0-f178.google.com (mail-yk0-f178.google.com [209.85.160.178]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id AF78E20CCD for ; Thu, 10 Sep 2015 17:40:09 +0000 (UTC) Received: by ykei199 with SMTP id i199so65644938yke.0 for ; Thu, 10 Sep 2015 10:40:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=maprtech.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Jxd1qqrtfuvwlYRcd+v2FH+1g1IZeZEgxKKHO8WHMqM=; b=noNL/EWhal0WP0lDflpTz4ZwbhVuXNr180g+bQIXr9JYNqrLYOWU/gZSx56KwbyNWJ OACLl6RwI9gOL4CL3J3S5sn2le1q4tH1be7UdisGlA64r6DmBSuuWqGtliieyPamiwZM MqWhXOToAkizr/ukgGn43bFDcP4t+BlIFNUF0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=Jxd1qqrtfuvwlYRcd+v2FH+1g1IZeZEgxKKHO8WHMqM=; b=gTif5J/L8nfrsM6gGI1p7idvFA69lfMOrT3yeKgjr+yw8Oyu73gqaw8xtw1P6pPxR1 cAw4O0hm5/z2JukyLWHkUqafFN3/om2Wc3jUSY3O9U7L8raQsSNAjuQrPl0M0BrUttou 39aUdY75ABwltdIeGVLnzEn63KqkZf8qtrjoDvwXwT8h7BvJd+4PqLdwZjZ7F349zPh/ W+sO7M6vpk6lPNuSpYAAlLxYiUo3IeH6wBRbIsf+O+FqlC6U1ZqkXgtqEG4pgw6PX/V9 ziFPKPApbiB239aNpDjrgvWP5qsuu9J6sLduPFTupHsY7fgu68PweT5nihCW02q2g6Iu QjBA== X-Gm-Message-State: ALoCoQl23iiwjKYTd+Xun30mr+7m0lV1/U+6qW7QuvdJh3xFDnvGYmdwmfLRtpzQu/D5qRErdKo9 MIME-Version: 1.0 X-Received: by 10.13.253.69 with SMTP id n66mr49737426ywf.0.1441906808215; Thu, 10 Sep 2015 10:40:08 -0700 (PDT) Received: by 10.37.216.23 with HTTP; Thu, 10 Sep 2015 10:40:08 -0700 (PDT) In-Reply-To: References: Date: Thu, 10 Sep 2015 10:40:08 -0700 Message-ID: Subject: Re: In list filter evaluation : room for improvement in run-time code generation. From: Hsuan Yi Chu To: dev@drill.apache.org Content-Type: multipart/alternative; boundary=94eb2c06a5721a99e4051f681764 --94eb2c06a5721a99e4051f681764 Content-Type: text/plain; charset=UTF-8 I believe the usage of Semi-Join had been proposed before. Would that new operator help in this scenario you think? On Wed, Sep 9, 2015 at 8:16 PM, Jinfeng Ni wrote: > The reason that the in-list join approach is not fast enough : > the query has 5 in-lists ORed together. Each in-list is converted > to a left outer join. After the 5 left outer join, there is a filter. > > Since left outer join does not prune any row from left side, > which is the base table in this case, essentially each join has > to scan the same # of rows as the base table, and copy > to the outgoing batch. That is, although the in-list evaluation > is using hash-based probe, which is faster than the original > filter evaluation, still 5 left out join incurs big overhead > in scanning/copying the data. > > The UDF idea in #2 is essentially doing the same kind of hash-based > probe in filter evaluation. The hash-table will be initialized as > a workspace variable in the doSetup(). Then, the doEval() will > simply probe the hash-table. I feel it would achieve the same > benefit of join approach, while avoid the overhead of re-scanning > the data multiple times. > > However, the current infrastructure seems miss the support > of VarArg in Drill's build-in or UDF, which is required to implement > this idea. > > > > On Wed, Sep 9, 2015 at 5:40 PM, Aman Sinha wrote: > > > Yes, this would be a good enhancement. Any improvement to the > > efficiency/compactness of the generated code is complimentary to other > > optimizations such as parquet filter pushdown. I recall that there was a > > JIRA a while ago with hundreds or thousands of filter conditions > creating a > > really bloated generated code - we should revisit that at some point to > > identify scope for improvement. > > I am not so sure about the UDF suggestion in #2. It seems like > > identifying why the large IN-list join approach was slow and fixing that > > would be a general solution. > > > > Aman > > > > On Wed, Sep 9, 2015 at 1:31 PM, Jinfeng Ni > wrote: > > > > > Weeks ago there was a message on drill user list, reporting performance > > > issues caused by in list filter [1]. The query has filter: > > > > > > WHERE > > > c0 IN (v_00, v_01, v_02, v_03, ... ) > > > OR > > > c1 IN (v_11, v_11, v_12, v_13, ....) > > > OR > > > c2 IN ... > > > OR > > > c3 IN ... > > > OR > > > .... > > > > > > The profile shows that most of query time is spent on filter > evaluation. > > > One workaround that we recommend was to re-write the query so that the > > > planner would convert in list into join operation. Turns out that > > > converting > > > into join did help improve performance, but not as much as we wanted. > > > > > > The original query has parquet as the data source. Therefore, the ideal > > > solution is parquet filter pushdown, which DRILL-1950 would address. > > > > > > On the other hand, I noticed that there seems to be room for > improvement > > > in the run-time generated code. In particular, for " c0 in (v_00, v_01, > > > ...)", > > > Drill will evaluate it as : > > > c0 = v_00 OR c0 = v_01 OR ... > > > > > > Each reference of "c0" will lead to initialization of vector and holder > > > assignment in the generated code. There is redundant evaluation for > > > the common reference. > > > > > > I put together a patch,which will avoid the redundant evaluation for > the > > > common reference. Using TPCH scale factor 10's lineitem table, I saw > > > quite surprising improvement. (run on Mac with embedded drillbit) > > > > > > 1) In List uses integer type [2] > > > master branch : 12.53 seconds > > > patch on top of master branch : 7.073 seconds > > > That's almost 45% improvement. > > > > > > 2) In List uses binary type [3] > > > master branch : 198.668 seconds > > > patch on top of master branch: 20.37 seconds > > > > > > Two thoughts: > > > 1. Will code size impact Janino compiler optimization or jvm hotspot > > > optimization? Otherwise, it seems hard to explain the performance > > > difference of removing the redundant evaluation. That might imply > > > that the efficiency of run-time generated code may degrade with > > > more expressions in the query (?) > > > > > > 2. For In-List filter, it might make sense to create a Drill UDF. The > > > UDF will build a heap-based hashtable in setup, in a similar way > > > as what the join approach will do. > > > > > > I'm going to open a JIRA to submit the patch for review, as I feel > > > it will benefit not only the in list filter, but also expressions with > > > common column references. > > > > > > > > > [1] > > > > > > > > > https://mail-archives.apache.org/mod_mbox/drill-user/201508.mbox/%3CCAC-7oTym0Yzr2RmXhDPag6k41se-uTkWu0QC%3DMABb7s94DJ0BA%40mail.gmail.com%3E > > > > > > [2] https://gist.github.com/jinfengni/7f6df9ed7d2c761fed33 > > > > > > [3] https://gist.github.com/jinfengni/7460f6d250f0d00009ed > > > > > > --94eb2c06a5721a99e4051f681764--