drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From weijie tong <tongweijie...@gmail.com>
Subject Propose about join push down
Date Tue, 19 Sep 2017 15:45:36 GMT
   This is a propose about join query tuning by pushing down the join
condition. Welcome suggestion ,discussion,objection .

   Suppose we have a join query "select t1.a,t1.s,t3.d (select a, sum(b) as
s from t1 where a='1' group by a ) t2 join t3 on t2.a = t3.a"  .  This
query will be transferred to a hashjoin or boradcast hashjoin (if metadata
is accurate ). But the t3's rows will all be pulled out from the storage.
If the t3 is a large table,the performance will be unacceptable.
If we can first get the 'a' result set of the inner query,then we pushed
down the result set to the right table t3's scan node. The right table's
scan will be quickly.

     possible solutions :
     1. A new physical operator or  broadcast join ,hash join enhancements
, which need to first query the left table's data, then push down the
filtered left join condition column set to the right table stream, once
confirmed the pushed down , works as normal join query logic.
     2. The pushed down join condition set maybe two possible formats bloom
filters bytes  or list of strings.
     3. RecordBatch needs to support to push down 2's data down stream.
     4. SubScan needs to hold the 2's data,and wait for next real call to
push down to the storage level query.
     5. Storage level should have an interface to indicate whether it
supports to solve the pushed down bloom filter or list of strings.

     Since this violates drill's data flow direction,it seems a lot of work
to do ,to change to implement this feature.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message