Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Mon, 6 Nov 2017 23:28:00 +0000 (UTC)
From: "Boaz Ben-Zvi (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.13116539.1510010822000.164177.1510010880057@Atlassian.JIRA>
In-Reply-To: <JIRA.13116539.1510010822000@Atlassian.JIRA>
References: <JIRA.13116539.1510010822000@Atlassian.JIRA> <JIRA.13116539.1510010822719@jira-lw-us.apache.org>
Subject: [jira] [Created] (DRILL-5935) Hash Join projects unneeded columns
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 06 Nov 2017 23:28:05 -0000

Boaz Ben-Zvi created DRILL-5935:
-----------------------------------

             Summary: Hash Join projects unneeded columns
                 Key: DRILL-5935
                 URL: https://issues.apache.org/jira/browse/DRILL-5935
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Relational Operators
    Affects Versions: 1.11.0
            Reporter: Boaz Ben-Zvi
            Priority: Minor


 The Hash Join operator projects all its input columns, including unneeded ones, relying on (multiple) project operators downstream to remove those columns. This is significantly wasteful, in both time and space (as each value is copied individually). Instead, the Hash Join itself should not project these unneeded columns. 

   In the following example, the join-key columns need not be projected. However the two hash join operators do project them.  Another problem: The join-key columns are copied from BOTH sides (build and probe), which is a waste, as both are IDENTICAL.
   Last - the plan in this example places the first join under the _build_ side of the second join; and the unneeded column from the first join (the join-key) is taken and finally projected by the second join. 

The sample query is:
{code}
select c.c_first_name, c.c_last_name, s.ss_quantity, a.ca_city 
from dfs.`/data/json/s1/customer` c, dfs.`/data/json/s1/store_sales` s, dfs.`/data/json/s1/customer_address` a
where c.c_customer_sk = s.ss_customer_sk and c.c_customer_id = a.ca_address_id; 
{code}

The plan first builds on 'customer_address' and probes with 'customer', and the output projects all 6 columns (2 from 'a', 4 from 'c'). Then the second join builds on all those 6 columns from the first join, and probes from the large table 'store_sales', and finally all 8 columns are projected (see below). Then 3 project operators are used to remove the unneeded columns (see attached profile) - hence more waste.

{code}
    public void projectBuildRecord(int buildIndex, int outIndex)
        throws SchemaChangeException
    {
        {
            vv3 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv0 [((buildIndex)>>> 16)]);
        }
        {
            vv9 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv6 [((buildIndex)>>> 16)]);
        }
        {
            vv15 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv12 [((buildIndex)>>> 16)]);
        }
        {
            vv21 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv18 [((buildIndex)>>> 16)]);
        }
        {
            vv27 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv24 [((buildIndex)>>> 16)]);
        }
        {
            vv33 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv30 [((buildIndex)>>> 16)]);
        }
    }

    public void projectProbeRecord(int probeIndex, int outIndex)
        throws SchemaChangeException
    {
        {
            vv39 .copyFromSafe((probeIndex), (outIndex), vv36);
        }
        {
            vv45 .copyFromSafe((probeIndex), (outIndex), vv42);
        }
    }
{code}
 

--
This message was sent by Atlassian JIRA
(v6.4.14#64029)