drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3765) Partition prune rule is unnecessary fired multiple times.
Date Thu, 12 Nov 2015 00:26:11 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001400#comment-15001400
] 

Jinfeng Ni commented on DRILL-3765:
-----------------------------------

Did some preliminary testing to see how much performance we may gain from the patch, if we
move the PruneScanRules into a HepPlanner, once the project/filter pushdown are applied. Here
is the result when run on mac.  

date: tpcds sample dataset:
1. Create a partitioned table.  This produces a table with 18000 parquet files. 
{code}
create table dfs.tmp.store_pb_item_sk partition by (ss_item_sk) as select * from store_sale;
{code} 

2. Query the partitioned table with filter referring the partition column (ss_item_sk) and
non-partitioning column.
{code}
explain plan for select ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk from
dfs.tmp.store_pb_item_sk where ss_item_sk in (100, 200, 300, 400, 500) and ss_customer_sk
= 96479;
{code}  

3. Results:
{code}
alter session set `planner.enable_hep_opt` = true;

explain plan for select ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk from
dfs.tmp.store_pb_item_sk where ss_item_sk in (100, 200, 300, 400, 500) and ss_customer_sk
= 96479;

1 row selected (5.246 seconds)

alter session set `planner.enable_hep_opt` = false;
explain plan for select ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk from
dfs.tmp.store_pb_item_sk where ss_item_sk in (100, 200, 300, 400, 500) and ss_customer_sk
= 96479;

+------+------+
1 row selected (9.412 seconds)
{code}

By avoiding the repeated PruneScanRule executions, the planning time is reduced from 9.4 seconds
to 5.2 seconds.  With more parquet files in the table or multiple table join query, it would
expected that we might see even big improvements with this patch.

With parquet metadata cache file created, I saw similar number between the existing number
and the new number. 

Log shows that the existing code indeed would fire the PruneScanRules multiple times, including
the directory-based pruning and partitioning column (from CTAS) based pruning. With the patch,
partition pruning will be fired once for directory-based pruning and once for partitioning
column pruning. That explains the performance gain we saw in this preliminary test.




> Partition prune rule is unnecessary fired multiple times. 
> ----------------------------------------------------------
>
>                 Key: DRILL-3765
>                 URL: https://issues.apache.org/jira/browse/DRILL-3765
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> It seems that the partition prune rule may be fired multiple times, even after the first
rule execution has pushed the filter into the scan operator. Since partition prune has to
build the vectors to contain the partition /file / directory information, to invoke the partition
prune rule unnecessary may lead to big memory overhead.
> Drill planner should avoid the un-necessary partition prune rule, in order to reduce
the chance of hitting OOM exception, while the partition prune rule is executed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message