drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From WeiWan <wei...@sunteng.com>
Subject Re: About the IN condition push down.
Date Fri, 18 Nov 2016 02:06:52 GMT
That sounds perfect if we can push down all those where conditions into scan/reader. Really
like to see it in the next few releases 😁.


Regards
Flow Wei



> On Nov 18, 2016, at 03:15, Chunhui Shi <cshi@maprtech.com> wrote:
> 
> I feel that we should allow preserving IN operator and not transforming it
> to OR or HashJoin, and seems this has to be done in Calcite. So we could
> allow pushing IN to scan/reader to utilize progressing data formats and
> index designs, e.g. min/max, a variety of bloom filters.
> 
> On Wed, Nov 16, 2016 at 12:59 AM, WeiWan <weiwan@sunteng.com <mailto:weiwan@sunteng.com>>
wrote:
> 
>> Hi Nicolas,
>> 
>> I know the option “planner.in_subquery_threshold”. But as a Drill plugin,
>> we don’t want to force people to set this option. Besides it could hurt the
>> performance if we set it to a very large number.
>> 
>> I’m considering using a rule like this to get the in-lists:
>> 
>>  public static StoragePluginOptimizerRule JoinFromIn = new
>> StoragePluginOptimizerRule(
>>      RelOptHelper.some(HashJoinPrel.class,
>>          RelOptHelper.some(ProjectPrel.class,
>>              RelOptHelper.any(ScanPrel.class)),
>>          RelOptHelper.some(BroadcastExchangePrel.class,
>>              RelOptHelper.some(HashAggPrel.class,
>>                  RelOptHelper.any(ValuesPrel.class))))
>>      , "JoinFromIn") {
>> 
>>    // ....
>> 
>>  }
>> 
>> from the physical plan :
>> 
>> | 00-00    Screen
>> 00-01      UnionExchange
>> 01-01        Project(user_id=[$0])
>> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
>> 01-04            Project($f15=[$0], $f69=[$0])
>> 01-05              Scan(groupscan=[TestGroupScan@
>> 4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, rsFilter:null},
>> columns=[`user_id`]}])
>> 01-03            BroadcastExchange
>> 02-01              HashAgg(group=[{0}])
>> 02-02                Values
>> 
>> 
>> When I found a HashJoin with a Scan and a HashAgg with Values, I can
>> assume that it is transform from an in-lists.
>> 
>> But I found that it doesn’t match any RelNode.  Any suggestions for my
>> rule?
>> 
>> 
>> Regards
>> Flow Wei
>> 
>> 
>> 
>>> On Nov 16, 2016, at 15:31, Nicolas Paris <niparisco@gmail.com> wrote:
>>> 
>>> 2016-11-16 8:17 GMT+01:00 WeiWan <weiwan@sunteng.com <mailto:weiwan@sunteng.com>
<mailto:
>> weiwan@sunteng.com>>:
>>> 
>>>> Hi Julian,
>>>> 
>>>> What I mean by “notified about the IN condition" here is My
>> TestGroupScan
>>>> should knows the IN condition when people writes SQLs like “WHERE a IN
>>>> (…)”. It is difficult for me because Drill’s physical planner has
>> transform
>>>> it into a HashJoin.
>>>> 
>>>> For example if the values of IN condition lest than 20, the physical
>> plan
>>>> looks like this:
>>>> 
>>>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
>>>> test where user_id in (1,2,3,4,5,6);
>>>> +------+------+
>>>> | text | json |
>>>> +------+------+
>>>> | 00-00    Screen
>>>> 00-01      UnionExchange
>>>> 01-01        Project(user_id=[$0])
>>>> 01-02          SelectionVectorRemover
>>>> 01-03            Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3),
>> =($0,
>>>> 4), =($0, 5), =($0, 6))])
>>>> 01-04              Scan(groupscan=[TestGroupScan@
>>>> 58253efd{Spec=TestScanSpec@893899c{table:test, rsFilter:In($user_id:
>>>> (6)[1,2,3,4,5 ...])}, columns=[`user_id`]}])
>>>> 
>>>> And I can use a rule like this to fetch the filter:
>>>> 
>>>> public static StoragePluginOptimizerRule TestRule = new
>>>> StoragePluginOptimizerRule(
>>>>     RelOptHelper.some(FilterPrel.class,
>>>>         RelOptHelper.any(ScanPrel.class)), "TestRule") {
>>>>   @Override
>>>>   public void onMatch(RelOptRuleCall call) {
>>>>     FilterPrel filter = (FilterPrel) call.rel(0);
>>>>     ScanPrel scan = (ScanPrel) call.rel(1);
>>>>     RexNode condition = filter.getCondition();
>>>> 
>>>>     // Then we can do the scan optimization with “filter”, which
>>>> including IN condition.
>>>>   }
>>>> };
>>>> 
>>>> But when the values are over 20, the physical plan becomes like this:
>>>> 
>>> 
>>> ​There is this drill config parameter (drill>= 1.8)
>> in_subquery_threshold .
>>> Default value is 20. Just set it to a very high value (its a long field)​
>>> 
>>> 
>>> 
>>>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
>>>> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,
>>>> 13,14,15,16,17,18,19,20);
>>>> +------+------+
>>>> | text | json |
>>>> +------+------+
>>>> | 00-00    Screen
>>>> 00-01      UnionExchange
>>>> 01-01        Project(user_id=[$0])
>>>> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
>>>> 01-04            Project($f15=[$0], $f69=[$0])
>>>> 01-05              Scan(groupscan=[TestGroupScan@
>>>> 132c4830{Spec=TestScanSpec@416812cf{table:test, rsFilter:null},
>>>> columns=[`user_id`]}])
>>>> 01-03            BroadcastExchange
>>>> 02-01              HashAgg(group=[{0}])
>>>> 02-02                Values
>>>> 
>>>> Now I need to find another way to know and get the IN condition,
>> including
>>>> the field name, and the values of IN clause. How can I do this?
>>>> 
>>>> 
>>>> Regards
>>>> Flow Wei
>>>> 
>>>> 
>>>> 
>>>>> On Nov 16, 2016, at 13:32, Julian Hyde <jhyde@apache.org> wrote:
>>>>> 
>>>>> I’m not sure what you mean by “notified about the IN condition”.
You
>>>> either have to convert it to a semi-join or not.
>>>>> 
>>>>> I don’t know how expensive hash-joins are in Drill. If they’re
>>>> expensive, you could hand-write a UDF that builds a java hash-map, and
>> see
>>>> whether it performs better.
>>>>> 
>>>>> Julian
>>>>> 
>>>>> 
>>>>>> On Nov 15, 2016, at 7:38 PM, WeiWan <weiwan@sunteng.com> wrote:
>>>>>> 
>>>>>> Hi!
>>>>>> 
>>>>>> I’m working on a new data format using Apache Drill as query engine.
>> It
>>>> has very nice scan speed, high compression ratio, and contains indexes
>> to
>>>> filter out irrelevant parts. Even more interesting is it supports
>> extremely
>>>> fast messages realtime ingestion. It is not yet open sourced but soon.
>>>>>> 
>>>>>> But right now we run into a issue and need your help. We know Drill
>> can
>>>> push down predicates to GroupScan. But it may transform the large IN
>>>> condition into an HashJoin to improve performance. Like this:
>>>>>> 
>>>>>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id
from
>>>> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,
>>>> 13,14,15,16,17,18,19,20);
>>>>>> +------+------+
>>>>>> | text | json |
>>>>>> +------+------+
>>>>>> | 00-00    Screen
>>>>>> 00-01      UnionExchange
>>>>>> 01-01        Project(user_id=[$0])
>>>>>> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
>>>>>> 01-04            Project($f15=[$0], $f69=[$0])
>>>>>> 01-05              Scan(groupscan=[TestGroupScan@
>>>> 4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, rsFilter:null},
>>>> columns=[`user_id`]}])
>>>>>> 01-03            BroadcastExchange
>>>>>> 02-01              HashAgg(group=[{0}])
>>>>>> 02-02                Values
>>>>>> 
>>>>>> 
>>>>>> The problem is we need to know there is a IN condition, either by
>>>> predicates push down or other ways, so that we can do the optimization
>> job.
>>>>>> 
>>>>>> Drill has an option “planner.in_subquery_threshold” can prevent
the
>>>> planner from transforming the IN condition, but we would like to keep
>> that
>>>> untouched because a large  “or (equals … )” can also slow down things.
>>>>>> 
>>>>>> So our question is, if we keep the “in_subquery” planner optimization,
>>>> is there any ways we can be notified about the IN condition?
>>>>>> 
>>>>>> Looking forward to your reply!
>>>>>> 
>>>>>> Regards
>>>>>> Flow Wei


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message