hawq-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lin Wen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAWQ-1597) Implement Runtime Filter for Hash Join
Date Mon, 04 Jun 2018 09:27:00 GMT

    [ https://issues.apache.org/jira/browse/HAWQ-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499944#comment-16499944

Lin Wen commented on HAWQ-1597:

Bloom filter for Local Hash Join(do not cross slice on outer table) has been finished. By
default, this GUC is disable. If want to use this feature, we should set hawq_hashjoin_bloomfilter
to true.

I have run TPCH benchmark testing on a HAWQ cluster(1 master node, 1 standby node, 3 segment
nodes) with 100G data. The result is below:

My summarization is:
 1. There are several queries probably can benefit from runtime filter since multiple table
joins, they are Q4, Q8, Q12, Q17, Q18. However only Q8 is improved(from 19 seconds to 13
 2. Hash Join in Q4, Q12, Q18 has cross slice on outer table, the Bloomfilter can't be pushed
down across slices currently, so these queries are not improved. It is expected. 
 3. Q17 has two types of SQL, the origin one is:
 sum(l_extendedprice) / 7.0 as avg_yearly
 p_partkey = l_partkey
 and p_brand = 'Brand#54'
 and p_container = 'JUMBO CASE'
 and l_quantity < (
 0.2 * avg(l_quantity)
 l_partkey = p_partkey

Another one is:
 with q17_part as (
 select p_partkey from part where 
 p_brand = 'Brand#23'
 and p_container = 'MED BOX'
 q17_avg as (
 select l_partkey as t_partkey, 0.2 * avg(l_quantity) as t_avg_quantity
 from lineitem 
 where l_partkey IN (select p_partkey from q17_part)
 group by l_partkey
 q17_price as (
 l_partkey IN (select p_partkey from q17_part)
 select cast(sum(l_extendedprice) / 7.0 as decimal(32,2)) as avg_yearly
 from q17_avg, q17_price
 t_partkey = l_partkey and l_quantity < t_avg_quantity;

The query plan of the later SQL is:

Since Hash join's left tree is parquet scan(no motion between slices), so the Bloom filter
can be used. It has improved a lot for this SQL. The result shows improved from 16 seconds
to 8.3 seconds(Q17* in the first graph). The speedup is about 1.93X.

If we choose more simple Hash Join query, not TPCH queries, like: select count * from part,
lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
with bloomfilter enable, the speedup can be more than 2X.

> Implement Runtime Filter for Hash Join
> --------------------------------------
>                 Key: HAWQ-1597
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1597
>             Project: Apache HAWQ
>          Issue Type: New Feature
>          Components: Query Execution
>            Reporter: Lin Wen
>            Assignee: Lin Wen
>            Priority: Major
>             Fix For:
>         Attachments: 111BA854-7318-46A7-8338-5F2993D60FA3.png, HAWQ Runtime Filter Design.pdf,
HAWQ Runtime Filter Design.pdf, q17_modified_hawq.gif
> Bloom filter is a space-efficient probabilistic data structure invented in 1970, which
is used to test whether an element is a member of a set.
> Nowdays, bloom filter is widely used in OLAP or data-intensive applications to quickly
filter data. It is usually implemented in OLAP systems for hash join. The basic idea is,
when hash join two tables, during the build phase, build a bloomfilter information for the
inner table, then push down this bloomfilter information to the scan of the outer table, so
that, less tuples from the outer table will be returned to hash join node and joined with
hash table. It can greatly improment the hash join performance if the selectivity is high.

This message was sent by Atlassian JIRA

View raw message