flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "godfrey he (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5859) support partition pruning on Table API & SQL
Date Mon, 27 Feb 2017 03:54:45 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885098#comment-15885098

godfrey he commented on FLINK-5859:

Hi, [~fhueske], Thanks for you advice. 

IMO, Rules including `PushProjectIntoBatchTableSourceScanRule`, `PushFilterIntoBatchTableSourceScanRule`,
`PartitionPruningRule`(maybe, we integrate it in PushFilterIntoBatchTableSourceScanRule) and
so on are need be applied only once and do not need cost model actually. And Rules including
`FilterCalcMergeRule`, `FilterJoinRule`, `DataSetCalcRule` and so on 
do not need real cost, dummy cost is enough. Rules including `LoptOptimizeJoinRule`, `JoinToMultiJoinRule`
and so on are applied with  real cost. So we want to break the optimization phase down into
3 phases later. The whole optimization include 5 steps: 
1. decorrelates a query
2. normalize the logical plan with HEP planner
3. optimize the logical plan with Volcano planner and dummy cost(including `FilterCalcMergeRule`,
`FilterJoinRule`, `DataSetCalcRule` and so on)
4. optimize the physical plan with HEP planner (including `PushProjectIntoBatchTableSourceScanRule`,
`PushFilterIntoBatchTableSourceScanRule` and so on)
5. optimize the physical plan with Volcano planner and real cost (including `LoptOptimizeJoinRule`,
`JoinToMultiJoinRule` and so on)

At that time, each optimization phase  keeps the complexity as small as possible. And your
concern can be eliminated also. 

Looking forward to your advice, thanks.

> support partition pruning on Table API & SQL
> --------------------------------------------
>                 Key: FLINK-5859
>                 URL: https://issues.apache.org/jira/browse/FLINK-5859
>             Project: Flink
>          Issue Type: New Feature
>          Components: Table API & SQL
>            Reporter: godfrey he
>            Assignee: godfrey he
> Many data sources are partitionable storage, e.g. HDFS, Druid. And many queries just
need to read a small subset of the total data. We can use partition information to prune or
skip over files irrelevant to the user’s queries. Both query optimization time and execution
time can be reduced obviously, especially for a large partitioned table.

This message was sent by Atlassian JIRA

View raw message