flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhangjing (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-5394) the estimateRowCount method of DataSetCalc didn't work
Date Tue, 27 Dec 2016 11:10:58 GMT

     [ https://issues.apache.org/jira/browse/FLINK-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhangjing updated FLINK-5394:
-----------------------------
    Description: 
The estimateRowCount method of DataSetCalc didn't work now. 
If I run the following code,
`
Table table = tableEnv
				.fromDataSet(data, "a, b, c")
				.groupBy("a")
				.select("a, a.avg, b.sum, c.count")
				.where("a == 1");
`
the cost of every node in Optimized node tree is :
`
DataSetAggregate(groupBy=[a], select=[a, AVG(a) AS TMP_0, SUM(b) AS TMP_1, COUNT(c) AS TMP_2]):
rowcount = 1000.0, cumulative cost = {3000.0 rows, 5000.0 cpu, 28000.0 io}
  DataSetCalc(select=[a, b, c], where=[=(a, 1)]): rowcount = 1000.0, cumulative cost = {2000.0
rows, 2000.0 cpu, 0.0 io}
      DataSetScan(table=[[_DataSetTable_0]]): rowcount = 1000.0, cumulative cost = {1000.0
rows, 1000.0 cpu, 0.0 io}
`
We expect the input rowcount of DataSetAggregate less than 1000, however the actual input
rowcount is still 1000 because the the estimateRowCount method of DataSetCalc didn't work.


There are two reasons caused to this:
1. when DataSetAggregate calls RelMetadataQuery.getRowCount(DataSetCalc) to estimate its input
rowcount which would dispatch to RelMdRowCount.
2. DataSetCalc is subclass of SingleRel, so previous function call would match getRowCount(SingleRel
rel, RelMetadataQuery mq) which would never use DataSetCalc.estimateRowCount.

I plan to resolve this problem by adding a FlinkRelMdRowCount which contains specific getRowCount
of Flink RelNodes.

  was:
The estimateRowCount method of DataSetCalc didn't work now. 
If I run the following code,
`
Table table = tableEnv
				.fromDataSet(data, "a, b, c")
				.groupBy("a")
				.select("a, a.avg, b.sum, c.count")
				.where("a == 1");
`
the cost of every node in Optimized node tree is :
`
DataSetAggregate(groupBy=[a], select=[a, AVG(a) AS TMP_0, SUM(b) AS TMP_1, COUNT(c) AS TMP_2]):
rowcount = 1000.0, cumulative cost = {3000.0 rows, 5000.0 cpu, 28000.0 io}
|_  DataSetCalc(select=[a, b, c], where=[=(a, 1)]): rowcount = 1000.0, cumulative cost = {2000.0
rows, 2000.0 cpu, 0.0 io}
     |_ DataSetScan(table=[[_DataSetTable_0]]): rowcount = 1000.0, cumulative cost = {1000.0
rows, 1000.0 cpu, 0.0 io}
`
We expect the input rowcount of DataSetAggregate less than 1000, however the actual input
rowcount is still 1000 because the the estimateRowCount method of DataSetCalc didn't work.


There are two reasons caused to this:
1. when DataSetAggregate calls RelMetadataQuery.getRowCount(DataSetCalc) to estimate its input
rowcount which would dispatch to RelMdRowCount.
2. DataSetCalc is subclass of SingleRel, so previous function call would match getRowCount(SingleRel
rel, RelMetadataQuery mq) which would never use DataSetCalc.estimateRowCount.

I plan to resolve this problem by adding a FlinkRelMdRowCount which contains specific getRowCount
of Flink RelNodes.


> the estimateRowCount method of DataSetCalc didn't work
> ------------------------------------------------------
>
>                 Key: FLINK-5394
>                 URL: https://issues.apache.org/jira/browse/FLINK-5394
>             Project: Flink
>          Issue Type: Bug
>          Components: Table API & SQL
>            Reporter: zhangjing
>            Assignee: zhangjing
>
> The estimateRowCount method of DataSetCalc didn't work now. 
> If I run the following code,
> `
> Table table = tableEnv
> 				.fromDataSet(data, "a, b, c")
> 				.groupBy("a")
> 				.select("a, a.avg, b.sum, c.count")
> 				.where("a == 1");
> `
> the cost of every node in Optimized node tree is :
> `
> DataSetAggregate(groupBy=[a], select=[a, AVG(a) AS TMP_0, SUM(b) AS TMP_1, COUNT(c) AS
TMP_2]): rowcount = 1000.0, cumulative cost = {3000.0 rows, 5000.0 cpu, 28000.0 io}
>   DataSetCalc(select=[a, b, c], where=[=(a, 1)]): rowcount = 1000.0, cumulative cost
= {2000.0 rows, 2000.0 cpu, 0.0 io}
>       DataSetScan(table=[[_DataSetTable_0]]): rowcount = 1000.0, cumulative cost = {1000.0
rows, 1000.0 cpu, 0.0 io}
> `
> We expect the input rowcount of DataSetAggregate less than 1000, however the actual input
rowcount is still 1000 because the the estimateRowCount method of DataSetCalc didn't work.

> There are two reasons caused to this:
> 1. when DataSetAggregate calls RelMetadataQuery.getRowCount(DataSetCalc) to estimate
its input rowcount which would dispatch to RelMdRowCount.
> 2. DataSetCalc is subclass of SingleRel, so previous function call would match getRowCount(SingleRel
rel, RelMetadataQuery mq) which would never use DataSetCalc.estimateRowCount.
> I plan to resolve this problem by adding a FlinkRelMdRowCount which contains specific
getRowCount of Flink RelNodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message