flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5394) the estimateRowCount method of DataSetCalc didn't work
Date Fri, 13 Jan 2017 10:14:26 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821550#comment-15821550
] 

ASF GitHub Bot commented on FLINK-5394:
---------------------------------------

Github user fhueske commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3058#discussion_r95968750
  
    --- Diff: flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/dataset/DataSetSort.scala
---
    @@ -71,6 +72,21 @@ class DataSetSort(
         )
       }
     
    +  override def estimateRowCount(metadata: RelMetadataQuery): Double = {
    +    val inputRowCnt = metadata.getRowCount(this.getInput)
    +    if (inputRowCnt == null) {
    +      inputRowCnt
    +    } else {
    +      val rowCount = Math.max(inputRowCnt - limitStart, 0D)
    --- End diff --
    
    inputRowCount might also just be an estimate and not guaranteed to be precise. Returning
1 is more robust, because it does not result in no-cost operators downstream.


> the estimateRowCount method of DataSetCalc didn't work
> ------------------------------------------------------
>
>                 Key: FLINK-5394
>                 URL: https://issues.apache.org/jira/browse/FLINK-5394
>             Project: Flink
>          Issue Type: Bug
>          Components: Table API & SQL
>            Reporter: zhangjing
>            Assignee: zhangjing
>
> The estimateRowCount method of DataSetCalc didn't work now. 
> If I run the following code,
> {code}
> Table table = tableEnv
>   .fromDataSet(data, "a, b, c")
>   .groupBy("a")
>   .select("a, a.avg, b.sum, c.count")
>   .where("a == 1");
> {code}
> the cost of every node in Optimized node tree is :
> {code}
> DataSetAggregate(groupBy=[a], select=[a, AVG(a) AS TMP_0, SUM(b) AS TMP_1, COUNT(c) AS
TMP_2]): rowcount = 1000.0, cumulative cost = {3000.0 rows, 5000.0 cpu, 28000.0 io}
>   DataSetCalc(select=[a, b, c], where=[=(a, 1)]): rowcount = 1000.0, cumulative cost
= {2000.0 rows, 2000.0 cpu, 0.0 io}
>       DataSetScan(table=[[_DataSetTable_0]]): rowcount = 1000.0, cumulative cost = {1000.0
rows, 1000.0 cpu, 0.0 io}
> {code}
> We expect the input rowcount of DataSetAggregate less than 1000, however the actual input
rowcount is still 1000 because the the estimateRowCount method of DataSetCalc didn't work.

> There are two reasons caused to this:
> 1. Didn't provide custom metadataProvider yet. So when DataSetAggregate calls RelMetadataQuery.getRowCount(DataSetCalc)
to estimate its input rowcount which would dispatch to RelMdRowCount.
> 2. DataSetCalc is subclass of SingleRel. So previous function call would match getRowCount(SingleRel
rel, RelMetadataQuery mq) which would never use DataSetCalc.estimateRowCount.
> The question would also appear to all Flink RelNodes which are subclass of SingleRel.
> I plan to resolve this problem by adding a FlinkRelMdRowCount which contains specific
getRowCount of Flink RelNodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message