Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DEA5F200D43 for ; Tue, 7 Nov 2017 00:28:04 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id DD2B2160BFF; Mon, 6 Nov 2017 23:28:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 316EA160BEC for ; Tue, 7 Nov 2017 00:28:04 +0100 (CET) Received: (qmail 3431 invoked by uid 500); 6 Nov 2017 23:28:03 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 3421 invoked by uid 99); 6 Nov 2017 23:28:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Nov 2017 23:28:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 7D2CBCA353 for ; Mon, 6 Nov 2017 23:28:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id kpqnXxrgbXI9 for ; Mon, 6 Nov 2017 23:28:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 21ADE60F1E for ; Mon, 6 Nov 2017 23:28:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5C298E05B7 for ; Mon, 6 Nov 2017 23:28:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0F7A223F05 for ; Mon, 6 Nov 2017 23:28:00 +0000 (UTC) Date: Mon, 6 Nov 2017 23:28:00 +0000 (UTC) From: "Boaz Ben-Zvi (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (DRILL-5935) Hash Join projects unneeded columns MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 06 Nov 2017 23:28:05 -0000 Boaz Ben-Zvi created DRILL-5935: ----------------------------------- Summary: Hash Join projects unneeded columns Key: DRILL-5935 URL: https://issues.apache.org/jira/browse/DRILL-5935 Project: Apache Drill Issue Type: Bug Components: Execution - Relational Operators Affects Versions: 1.11.0 Reporter: Boaz Ben-Zvi Priority: Minor The Hash Join operator projects all its input columns, including unneeded ones, relying on (multiple) project operators downstream to remove those columns. This is significantly wasteful, in both time and space (as each value is copied individually). Instead, the Hash Join itself should not project these unneeded columns. In the following example, the join-key columns need not be projected. However the two hash join operators do project them. Another problem: The join-key columns are copied from BOTH sides (build and probe), which is a waste, as both are IDENTICAL. Last - the plan in this example places the first join under the _build_ side of the second join; and the unneeded column from the first join (the join-key) is taken and finally projected by the second join. The sample query is: {code} select c.c_first_name, c.c_last_name, s.ss_quantity, a.ca_city from dfs.`/data/json/s1/customer` c, dfs.`/data/json/s1/store_sales` s, dfs.`/data/json/s1/customer_address` a where c.c_customer_sk = s.ss_customer_sk and c.c_customer_id = a.ca_address_id; {code} The plan first builds on 'customer_address' and probes with 'customer', and the output projects all 6 columns (2 from 'a', 4 from 'c'). Then the second join builds on all those 6 columns from the first join, and probes from the large table 'store_sales', and finally all 8 columns are projected (see below). Then 3 project operators are used to remove the unneeded columns (see attached profile) - hence more waste. {code} public void projectBuildRecord(int buildIndex, int outIndex) throws SchemaChangeException { { vv3 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv0 [((buildIndex)>>> 16)]); } { vv9 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv6 [((buildIndex)>>> 16)]); } { vv15 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv12 [((buildIndex)>>> 16)]); } { vv21 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv18 [((buildIndex)>>> 16)]); } { vv27 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv24 [((buildIndex)>>> 16)]); } { vv33 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv30 [((buildIndex)>>> 16)]); } } public void projectProbeRecord(int probeIndex, int outIndex) throws SchemaChangeException { { vv39 .copyFromSafe((probeIndex), (outIndex), vv36); } { vv45 .copyFromSafe((probeIndex), (outIndex), vv42); } } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)