drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <jinfengn...@gmail.com>
Subject Re: Hash Agg vs Streaming Agg for a smaller data set
Date Fri, 10 Jul 2015 23:48:45 GMT
I'm not clear which column is the partitioning column. From what you
described, row count of aggregator in the first case is larger than that in
the second case, since the former one requires full table scan. Cost-wise,
hash-agg would make more sense when the input is larger, since
streaming-agg requires sort, which could be expensive for large dataset.

My guess is the difference of rowcounts in the two cases cause the
difference in the query plan.

One suggestion. If you want to check query plan, it would make more sense
to try with reasonably large data.  Drill's costing model is not fully
calibrated yet;  a small dataset like tpch_0.0.1 might make it hard for the
cost model to pick the right plan. On the other hand, if the dataset is
small, two different plans normally would not make a big difference in
terms of performance. In other words, try to use large dataset if you are
interested in performance testing / plan verification.





On Fri, Jul 10, 2015 at 4:27 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Hi,
>
> Info about Data : The data is auto partitioned tpch 0.01 data. The second
> filter is a non-partitioned column, so in the first case the 'OR' predicate
> results in a full-table scan, while in the second case, partition pruning
> takes effect.
>
> The first case results in a hash agg and the second case in a streaming
> agg. Any idea why?
>
> 1. explain plan for select distinct l_modline, l_moddate from
> `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> '1992-01-01' or l_shipdate=date'1992-01-01';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(l_modline=[$0], l_moddate=[$1])
> 00-02        Project(l_modline=[$0], l_moddate=[$1])
> 00-03          HashAgg(group=[{0, 1}])
> 00-04            Project(l_modline=[$2], l_moddate=[$0])
> 00-05              SelectionVectorRemover
> 00-06                Filter(condition=[OR(=($0, 1992-01-01), =($1,
> 1992-01-01))])
> 00-07                  Project(l_moddate=[$2], l_shipdate=[$1],
> l_modline=[$0])
> 00-08                    Scan..........
>
> 2. explain plan for select distinct l_modline, l_moddate from
> `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> '1992-01-01' and l_shipdate=date'1992-01-01';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(l_modline=[$0], l_moddate=[$1])
> 00-02        Project(l_modline=[$0], l_moddate=[$1])
> 00-03          StreamAgg(group=[{0, 1}])
> 00-04            Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
> 00-05              Project(l_modline=[$2], l_moddate=[$0])
> 00-06                SelectionVectorRemover
> 00-07                  Filter(condition=[AND(=($0, 1992-01-01), =($1,
> 1992-01-01))])
> 00-08                    Project(l_moddate=[$2], l_shipdate=[$1],
> l_modline=[$0])
> 00-09                      Scan.....................
>
> - Rahul
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message