[ https://issues.apache.org/jira/browse/KYLIN3540?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Zhong Yanghong updated KYLIN3540:

Description:
Current criteria to recommend mandatory cuboids is based on the average rollup count collected
with query metrics. There's a disadvantage. An example is as follows:
Cuboid (A,B) has 1000 rows, prebuilt; Cuboid (B) has 10 rows, not prebuilt; The ground truth
for the rollup count from Cuboid (A,B) to Cuboid (B) is
{code}
Cuboid (A,B)  Cuboid (A) = 1000  10 = 990
{code}
Suppose B is evenly composed with A. Then for each value of B with A, the row count is 1000
* (10/100) = 100.
Now for sql
{code}
select B, count(*)
from T
where B = 'e1'
group by B
{code}
Then the rollup count by current algorithm will be
{code}
Cuboid (A,{'e1'})  return count = 100  1 = 99
{code}
which is much smaller than 990 due to the influence of lots of filtered row count.
It's better to calculate the rollup rate first and then multiple the parent cuboid row count
to estimate the rollup count. The refined formula is as follows:
{code}
Cuboid (A,B)  Cuboid (A,B) * (return count) / Cuboid (A,{'e1'}) = 10001000*1/100 = 990
{code}
Another sql
{code}
select count(*)
from T
where B in {'e1','e2'}
{code}
The rollup count by current algorithm will be
{code}
Cuboid (A,{'e1','e2'})  return count = 100*2  1 = 199
{code}
The rollup count by refined algorithm will be
{code}
Cuboid (A,B)  Cuboid (A,B) * (return count) / Cuboid (A,{'e1','e2'}) = 10001000*1/(100*2)
= 995
{code}
Above all, the refined algorithm will be much less influenced by filters in sql.
> Improve Mandatory Cuboid Recommendation Algorithm
> 
>
> Key: KYLIN3540
> URL: https://issues.apache.org/jira/browse/KYLIN3540
> Project: Kylin
> Issue Type: Improvement
> Reporter: Zhong Yanghong
> Assignee: Zhong Yanghong
> Priority: Major
>
> Current criteria to recommend mandatory cuboids is based on the average rollup count
collected with query metrics. There's a disadvantage. An example is as follows:
> Cuboid (A,B) has 1000 rows, prebuilt; Cuboid (B) has 10 rows, not prebuilt; The ground
truth for the rollup count from Cuboid (A,B) to Cuboid (B) is
> {code}
> Cuboid (A,B)  Cuboid (A) = 1000  10 = 990
> {code}
> Suppose B is evenly composed with A. Then for each value of B with A, the row count is
1000 * (10/100) = 100.
> Now for sql
> {code}
> select B, count(*)
> from T
> where B = 'e1'
> group by B
> {code}
> Then the rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1'})  return count = 100  1 = 99
> {code}
> which is much smaller than 990 due to the influence of lots of filtered row count.
> It's better to calculate the rollup rate first and then multiple the parent cuboid row
count to estimate the rollup count. The refined formula is as follows:
> {code}
> Cuboid (A,B)  Cuboid (A,B) * (return count) / Cuboid (A,{'e1'}) = 10001000*1/100 =
990
> {code}
> Another sql
> {code}
> select count(*)
> from T
> where B in {'e1','e2'}
> {code}
> The rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1','e2'})  return count = 100*2  1 = 199
> {code}
> The rollup count by refined algorithm will be
> {code}
> Cuboid (A,B)  Cuboid (A,B) * (return count) / Cuboid (A,{'e1','e2'}) = 10001000*1/(100*2)
= 995
> {code}
> Above all, the refined algorithm will be much less influenced by filters in sql.

This message was sent by Atlassian JIRA
(v7.6.3#76005)
