drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chunhui Shi <c...@maprtech.com>
Subject Re: About the IN condition push down.
Date Thu, 17 Nov 2016 19:15:14 GMT
I feel that we should allow preserving IN operator and not transforming it
to OR or HashJoin, and seems this has to be done in Calcite. So we could
allow pushing IN to scan/reader to utilize progressing data formats and
index designs, e.g. min/max, a variety of bloom filters.

On Wed, Nov 16, 2016 at 12:59 AM, WeiWan <weiwan@sunteng.com> wrote:

> Hi Nicolas,
>
> I know the option “planner.in_subquery_threshold”. But as a Drill plugin,
> we don’t want to force people to set this option. Besides it could hurt the
> performance if we set it to a very large number.
>
> I’m considering using a rule like this to get the in-lists:
>
>   public static StoragePluginOptimizerRule JoinFromIn = new
> StoragePluginOptimizerRule(
>       RelOptHelper.some(HashJoinPrel.class,
>           RelOptHelper.some(ProjectPrel.class,
>               RelOptHelper.any(ScanPrel.class)),
>           RelOptHelper.some(BroadcastExchangePrel.class,
>               RelOptHelper.some(HashAggPrel.class,
>                   RelOptHelper.any(ValuesPrel.class))))
>       , "JoinFromIn") {
>
>     // ....
>
>   }
>
> from the physical plan :
>
> | 00-00    Screen
> 00-01      UnionExchange
> 01-01        Project(user_id=[$0])
> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
> 01-04            Project($f15=[$0], $f69=[$0])
> 01-05              Scan(groupscan=[TestGroupScan@
> 4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, rsFilter:null},
> columns=[`user_id`]}])
> 01-03            BroadcastExchange
> 02-01              HashAgg(group=[{0}])
> 02-02                Values
>
>
> When I found a HashJoin with a Scan and a HashAgg with Values, I can
> assume that it is transform from an in-lists.
>
> But I found that it doesn’t match any RelNode.  Any suggestions for my
> rule?
>
>
> Regards
> Flow Wei
>
>
>
> > On Nov 16, 2016, at 15:31, Nicolas Paris <niparisco@gmail.com> wrote:
> >
> > 2016-11-16 8:17 GMT+01:00 WeiWan <weiwan@sunteng.com <mailto:
> weiwan@sunteng.com>>:
> >
> >> Hi Julian,
> >>
> >> What I mean by “notified about the IN condition" here is My
> TestGroupScan
> >> should knows the IN condition when people writes SQLs like “WHERE a IN
> >> (…)”. It is difficult for me because Drill’s physical planner has
> transform
> >> it into a HashJoin.
> >>
> >> For example if the values of IN condition lest than 20, the physical
> plan
> >> looks like this:
> >>
> >> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
> >> test where user_id in (1,2,3,4,5,6);
> >> +------+------+
> >> | text | json |
> >> +------+------+
> >> | 00-00    Screen
> >> 00-01      UnionExchange
> >> 01-01        Project(user_id=[$0])
> >> 01-02          SelectionVectorRemover
> >> 01-03            Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3),
> =($0,
> >> 4), =($0, 5), =($0, 6))])
> >> 01-04              Scan(groupscan=[TestGroupScan@
> >> 58253efd{Spec=TestScanSpec@893899c{table:test, rsFilter:In($user_id:
> >> (6)[1,2,3,4,5 ...])}, columns=[`user_id`]}])
> >>
> >> And I can use a rule like this to fetch the filter:
> >>
> >>  public static StoragePluginOptimizerRule TestRule = new
> >> StoragePluginOptimizerRule(
> >>      RelOptHelper.some(FilterPrel.class,
> >>          RelOptHelper.any(ScanPrel.class)), "TestRule") {
> >>    @Override
> >>    public void onMatch(RelOptRuleCall call) {
> >>      FilterPrel filter = (FilterPrel) call.rel(0);
> >>      ScanPrel scan = (ScanPrel) call.rel(1);
> >>      RexNode condition = filter.getCondition();
> >>
> >>      // Then we can do the scan optimization with “filter”, which
> >> including IN condition.
> >>    }
> >>  };
> >>
> >> But when the values are over 20, the physical plan becomes like this:
> >>
> >
> > ​There is this drill config parameter (drill>= 1.8)
> in_subquery_threshold .
> > Default value is 20. Just set it to a very high value (its a long field)​
> >
> >
> >
> >> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id from
> >> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,
> >> 13,14,15,16,17,18,19,20);
> >> +------+------+
> >> | text | json |
> >> +------+------+
> >> | 00-00    Screen
> >> 00-01      UnionExchange
> >> 01-01        Project(user_id=[$0])
> >> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
> >> 01-04            Project($f15=[$0], $f69=[$0])
> >> 01-05              Scan(groupscan=[TestGroupScan@
> >> 132c4830{Spec=TestScanSpec@416812cf{table:test, rsFilter:null},
> >> columns=[`user_id`]}])
> >> 01-03            BroadcastExchange
> >> 02-01              HashAgg(group=[{0}])
> >> 02-02                Values
> >>
> >> Now I need to find another way to know and get the IN condition,
> including
> >> the field name, and the values of IN clause. How can I do this?
> >>
> >>
> >> Regards
> >> Flow Wei
> >>
> >>
> >>
> >>> On Nov 16, 2016, at 13:32, Julian Hyde <jhyde@apache.org> wrote:
> >>>
> >>> I’m not sure what you mean by “notified about the IN condition”. You
> >> either have to convert it to a semi-join or not.
> >>>
> >>> I don’t know how expensive hash-joins are in Drill. If they’re
> >> expensive, you could hand-write a UDF that builds a java hash-map, and
> see
> >> whether it performs better.
> >>>
> >>> Julian
> >>>
> >>>
> >>>> On Nov 15, 2016, at 7:38 PM, WeiWan <weiwan@sunteng.com> wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>> I’m working on a new data format using Apache Drill as query engine.
> It
> >> has very nice scan speed, high compression ratio, and contains indexes
> to
> >> filter out irrelevant parts. Even more interesting is it supports
> extremely
> >> fast messages realtime ingestion. It is not yet open sourced but soon.
> >>>>
> >>>> But right now we run into a issue and need your help. We know Drill
> can
> >> push down predicates to GroupScan. But it may transform the large IN
> >> condition into an HashJoin to improve performance. Like this:
> >>>>
> >>>> 0: jdbc:drill:drillbit=localhost> explain plan for select user_id
from
> >> test where user_id in (1,2,3,4,5,6,7,8,9,10,11,12,
> >> 13,14,15,16,17,18,19,20);
> >>>> +------+------+
> >>>> | text | json |
> >>>> +------+------+
> >>>> | 00-00    Screen
> >>>> 00-01      UnionExchange
> >>>> 01-01        Project(user_id=[$0])
> >>>> 01-02          HashJoin(condition=[=($1, $2)], joinType=[inner])
> >>>> 01-04            Project($f15=[$0], $f69=[$0])
> >>>> 01-05              Scan(groupscan=[TestGroupScan@
> >> 4013af3d{Spec=TestScanSpec@73ccb9dc{table:test, rsFilter:null},
> >> columns=[`user_id`]}])
> >>>> 01-03            BroadcastExchange
> >>>> 02-01              HashAgg(group=[{0}])
> >>>> 02-02                Values
> >>>>
> >>>>
> >>>> The problem is we need to know there is a IN condition, either by
> >> predicates push down or other ways, so that we can do the optimization
> job.
> >>>>
> >>>> Drill has an option “planner.in_subquery_threshold” can prevent
the
> >> planner from transforming the IN condition, but we would like to keep
> that
> >> untouched because a large  “or (equals … )” can also slow down things.
> >>>>
> >>>> So our question is, if we keep the “in_subquery” planner optimization,
> >> is there any ways we can be notified about the IN condition?
> >>>>
> >>>> Looking forward to your reply!
> >>>>
> >>>> Regards
> >>>> Flow Wei
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message