drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-2400) Drill's query planner did not merge multiple filters into one.
Date Fri, 06 Mar 2015 21:38:38 GMT
Jinfeng Ni created DRILL-2400:
---------------------------------

             Summary: Drill's query planner did not merge multiple filters into one. 
                 Key: DRILL-2400
                 URL: https://issues.apache.org/jira/browse/DRILL-2400
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
            Reporter: Jinfeng Ni
            Assignee: Jinfeng Ni


Let's say we have a view defined with a WHERE clause as the following:

{code}
"create or replace view MyViewWithFilter as " +
        " SELECT  first_name, " +
        "         last_name, " +
        "         full_name, " +
        "         salary, " +
        "         employee_id, " +
        "         store_id, " +
        "         position_id, " +
        "         position_title, " +
        "         education_level " +
        " FROM cp.`employee.json` " +
        " WHERE position_id in (1, 2, 3 ) "
{code}

For the following query which refers to the above view in two places, one in out query, one
in the subquery, each reference adds an additional filter,

{code}
select dat.store_id
      , sum(dat.store_cost) as total_cost
 from (
       select store_id,  
                  position_id , 
                  sum( salary) as store_cost
       from MyViewWithFilter 
       where full_name in ( select n_name
                                       from cp.`tpch/nation.parquet`)
           and  education_level = 'GRADUATE DEGREE'   
           and position_id in ( select position_id 
                                          from MyViewWithFilter
                                          where position_title like '%VP%'
                                        )
  group by store_id, position_id
) dat
group by dat.store_id
order by dat.store_id
{code}

The current plan will be like the following :
{code}
00-00    Screen
00-01      Project(store_id=[$0], total_cost=[$1])
00-02        SelectionVectorRemover
00-03          Sort(sort0=[$0], dir0=[ASC])
00-04            HashAgg(group=[{0}], total_cost=[SUM($1)])
00-05              Project(store_id=[$0], store_cost=[$2])
00-06                HashAgg(group=[{0, 1}], store_cost=[SUM($2)])
00-07                  Project(store_id=[$1], position_id=[$2], $f3=[$0])
00-08                    HashJoin(condition=[=($4, $5)], joinType=[inner])
00-10                      Project($f3=[$0], $f5=[$1], $f6=[$2], $f8=[$3], $f11=[$2])
00-12                        HashJoin(condition=[=($4, $5)], joinType=[inner])
00-15                          SelectionVectorRemover
00-18                            Filter(condition=[=($3, 'GRADUATE DEGREE')])
00-20                              Project($f3=[CAST($2):ANY NOT NULL], $f5=[CAST($3):ANY
NOT NULL], $f6=[CAST($0):ANY NOT NULL], $f8=[CAST($4):ANY NOT NULL], $f9=[CAST($1):ANY NOT
NULL])
00-22                                SelectionVectorRemover
00-23                                  Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
00-24                                    Project(position_id=[$0], full_name=[$4], salary=[$1],
store_id=[$2], education_level=[$3])
00-25                                      Scan(groupscan=[EasyGroupScan [selectionRoot=/employee.json,
numFiles=1, columns=[`position_id`, `full_name`, `salary`, `store_id`, `education_level`],
files=[/employee.json]]])
00-14                          HashAgg(group=[{0}])
00-17                            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
00-09                      HashAgg(group=[{0}])
00-11                        Project(position_id=[CAST($0):ANY NOT NULL])
00-13                          SelectionVectorRemover
00-16                            Filter(condition=[LIKE(CAST($1):ANY NOT NULL, '%VP%')])
00-19                              Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
00-21                                Scan(groupscan=[EasyGroupScan [selectionRoot=/employee.json,
numFiles=1, columns=[`position_id`, `position_title`], files=[/employee.json]]])
{code}

In the plan,  the View's own filter and the additional filter were not merged, and we could
see either in the out query or subquery, there are two filters, like :
{code}
00-16                            Filter(condition=[LIKE(CAST($1):ANY NOT NULL, '%VP%')])
00-19                              Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
{code}

It is desired to have filter merged, because 
1) it could improve the possibility of partition pruning with the merged filter, since the
partition pruning will match a filter directly on top of SCAN.
   Filter   (dir0 = ' ....') 
      \     
   Filter   ( colA = '...')
          \
       SCAN
If the filter with (dir0) is not merged with the second filter, then partition pruning will
not kick in

2) Merged filter would allow short circuit evaluation, which could mean better performance,
compared with the case where we evaluate the filter one by one.

There are two reasons why filters are not merged:
1) Drill's planner does not enable the rule of MergerFilter rule
2) Even if we enable MergeFilter rule, in some situation, we still could not get filter merged,
because the cpu cost estimation for filter is a bit off the real cost.

In this JIRA, we'll 

1) Enable MergeFilter (Drill's version)
2) Fix the cost estimation formula for Filter operator.

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message