kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Na Zhai <na.z...@kyligence.io>
Subject 答复: question about how kylin chooses a specific cube design over others
Date Wed, 09 Jan 2019 08:12:07 GMT
Hi, Kang-Sen Lu.

If multiple cubes contain the same table, Kylin will do the following thing.

1. Kylin will choose the cube that contains all the dimensions in your SQL.
2. if there are still more than one cubes, Kylin will choose the cube that has less cost(it
depends on the dimension, measure and inner join number).

发送自 Windows 10 版邮件<https://go.microsoft.com/fwlink/?LinkId=550986>应用

________________________________
发件人: Kang-Sen Lu <klu@anovadata.com>
发送时间: Monday, January 7, 2019 11:11:09 PM
收件人: user@kylin.apache.org
主题: RE: question about how kylin chooses a specific cube design over others

Let me clarify my topn query problem clearly. Here is the query:

SELECT  (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)),
(SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)),
(SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))
 FROM  ZETTICSDW.A_MA_HOURLY_V   WHERE  ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501')
AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))

I have a data model, ma_aggs_model.

I have created two cube designs: ma_aggs_cube, ma_aggs_topn_cube.

In ma_aggs_cube, I have configured metric like “COUNT_DISTINCT(SUBSCRIBER_ID), SUM(HITS),…

In ma_aggs_topn_cube, I have configured metric like SUM(HITS), TOPN(HITS, GROUP BY SUBSCRIBER_ID),
… But no COUNT_DISTINCT(SUBSCRIBER_ID) metric.

When a query contains COUNT(DISTINCT(SUBSCRIBER_ID) is issued, kylin favored to use ma_aggs_topn_cube
over ma_aggs_cube. If I disabled ma_aggs_topc_cube, then kylin chooses ma_aggs_cube.

The question I have is why COUNT_DISTINCT(xyz) in select phrase would always favor the topn
cube, even if there is no “LIMIT xyz” and no “ORDER BY SUBSCRIBER_ID INSC” in the
query. What is the criterion in kylin which favors topn cube over non-topn cube?

Thanks.

Kang-sen

From: Kang-Sen Lu <klu@anovadata.com>
Sent: Monday, January 07, 2019 8:21 AM
To: user@kylin.apache.org
Subject: RE: question about how kylin chooses a specific cube design over others

I want to add another data point: if I disable the topn cube design, then the same query would
go through smoothly and correctly. Her eis the kylin log:

2019-01-07 08:18:16,256 INFO  [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] service.QueryService:387
: The original query:   SELECT  (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)),
(SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)),
(SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))
 FROM  ZETTICSDW.A_MA_HOURLY_V   WHERE  ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501')
AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))

2019-01-07 08:18:16,263 INFO  [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] service.QueryService:563
: The corrected query: SELECT  (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)),
(SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)),
(SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))
 FROM  ZETTICSDW.A_MA_HOURLY_V   WHERE  ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501')
AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))
LIMIT 50000
2019-01-07 08:18:16,284 INFO  [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:58
: Find candidates by table ZETTICSDW.A_MA_HOURLY_V and project=Anovadata : CUBE[name=ma_aggs_cube_6]
2019-01-07 08:18:16,285 INFO  [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule,
realizations before: [CUBE[name=ma_aggs_cube_6]], realizations after: [CUBE[name=ma_aggs_cube_6]]
2019-01-07 08:18:16,285 INFO  [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
realizations before: [CUBE[name=ma_aggs_cube_6]], realizations after: [CUBE[name=ma_aggs_cube_6]]
2019-01-07 08:18:16,286 INFO  [Query a78321d0-5913-758b-6eb1-db5a39f92735-134702] rules.RealizationSortRule:40
: CUBE[name=ma_aggs_cube_6] priority 1 cost 279.

Kang-sen

From: Kang-Sen Lu <klu@anovadata.com<mailto:klu@anovadata.com>>
Sent: Monday, January 07, 2019 8:06 AM
To: user@kylin.apache.org<mailto:user@kylin.apache.org>
Subject: question about how kylin chooses a specific cube design over others

I am using kylin 2.5.1. I have a data model and two cube designs on top of that data model.

One data model is used to perform aggregation over a set of aggregation groups. So the metric
are all using “COUNT DISTINCT”, and “SUM” functions. To speed up TOPN application,
I have created another smaller cube design which addresses the TOPN application specifically,
i.e. the metric contains only “COUNT DISTINCT”, and “TOPN”, but no “SUM” aggregation.

Here is a normal query, and somehow I am surprised that the TOPN cube was sleected. That caused
over 10,000,000 row of data being returned and failed.

From the kylin.log, you can see the “SELECT” statement and the cost evaluation data. I
am not sure what has caused the wrongly selection of the cube. I hope someone can provide
me with some hint or references.


2019-01-07 07:55:46,137 INFO  [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] service.QueryService:387
: The original query:  SELECT  (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)), (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)),
(SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)), (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)),
(SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)), (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)), (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))
 FROM  ZETTICSDW.A_MA_HOURLY_V   WHERE  ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501')
AND (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))

2019-01-07 07:55:46,184 INFO  [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:58
: Find candidates by table ZETTICSDW.A_MA_HOURLY_V and project=Anovadata : CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]
2019-01-07 07:55:46,185 INFO  [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule,
realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations
after: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]]
2019-01-07 07:55:46,185 INFO  [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
realizations before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations
after: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]]
2019-01-07 07:55:46,185 INFO  [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] rules.RealizationSortRule:40
: CUBE[name=ma_aggs_cube_6] priority 1 cost 279. CUBE[name=ma_aggs_topn_cube] priority 1 cost
105.
2019-01-07 07:55:46,186 INFO  [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RealizationSortRule, realizations
before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations after: [CUBE[name=ma_aggs_topn_cube],CUBE[name=ma_aggs_cube_6]]
2019-01-07 07:55:46,186 INFO  [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:95
: Adjust DimensionAsMeasure for FunctionDesc [expression=COUNT_DISTINCT, parameter=ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID,
returnType=null]
2019-01-07 07:55:46,186 INFO  [Query 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:75
: The realizations remaining: [CUBE[name=ma_aggs_topn_cube],CUBE[name=ma_aggs_cube_6]],and
the final chosen one for current olap context 0 is CUBE[name=ma_aggs_topn_cube]

Thanks.

Kang-sen
Mime
View raw message