drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Boaz Ben-Zvi (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5935) Hash Join projects unneeded columns
Date Mon, 06 Nov 2017 23:28:00 GMT
Boaz Ben-Zvi created DRILL-5935:
-----------------------------------

             Summary: Hash Join projects unneeded columns
                 Key: DRILL-5935
                 URL: https://issues.apache.org/jira/browse/DRILL-5935
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Relational Operators
    Affects Versions: 1.11.0
            Reporter: Boaz Ben-Zvi
            Priority: Minor


 The Hash Join operator projects all its input columns, including unneeded ones, relying on
(multiple) project operators downstream to remove those columns. This is significantly wasteful,
in both time and space (as each value is copied individually). Instead, the Hash Join itself
should not project these unneeded columns. 

   In the following example, the join-key columns need not be projected. However the two hash
join operators do project them.  Another problem: The join-key columns are copied from BOTH
sides (build and probe), which is a waste, as both are IDENTICAL.
   Last - the plan in this example places the first join under the _build_ side of the second
join; and the unneeded column from the first join (the join-key) is taken and finally projected
by the second join. 

The sample query is:
{code}
select c.c_first_name, c.c_last_name, s.ss_quantity, a.ca_city 
from dfs.`/data/json/s1/customer` c, dfs.`/data/json/s1/store_sales` s, dfs.`/data/json/s1/customer_address`
a
where c.c_customer_sk = s.ss_customer_sk and c.c_customer_id = a.ca_address_id; 
{code}

The plan first builds on 'customer_address' and probes with 'customer', and the output projects
all 6 columns (2 from 'a', 4 from 'c'). Then the second join builds on all those 6 columns
from the first join, and probes from the large table 'store_sales', and finally all 8 columns
are projected (see below). Then 3 project operators are used to remove the unneeded columns
(see attached profile) - hence more waste.

{code}
    public void projectBuildRecord(int buildIndex, int outIndex)
        throws SchemaChangeException
    {
        {
            vv3 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv0 [((buildIndex)>>>
16)]);
        }
        {
            vv9 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv6 [((buildIndex)>>>
16)]);
        }
        {
            vv15 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv12 [((buildIndex)>>>
16)]);
        }
        {
            vv21 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv18 [((buildIndex)>>>
16)]);
        }
        {
            vv27 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv24 [((buildIndex)>>>
16)]);
        }
        {
            vv33 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv30 [((buildIndex)>>>
16)]);
        }
    }

    public void projectProbeRecord(int probeIndex, int outIndex)
        throws SchemaChangeException
    {
        {
            vv39 .copyFromSafe((probeIndex), (outIndex), vv36);
        }
        {
            vv45 .copyFromSafe((probeIndex), (outIndex), vv42);
        }
    }
{code}
 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message