kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ShaoFeng Shi <shaofeng...@apache.org>
Subject Re: question about how kylin chooses a specific cube design over others
Date Thu, 10 Jan 2019 14:26:14 GMT
Hi Kang-sen,

I understand your problem. You're correct that today the cost calculation
for TopN is quite simple, which makes the cube has TopN always be selected
over the other cubes which doesn't have TopN (if they all can answer the
query). However, in some situations, the user may expect not to use TopN.

The code is here, which just simply 30% for the estimated cost:

https://github.com/apache/kylin/blob/master/core-metadata/src/main/java/org/apache/kylin/measure/topn/TopNMeasureType.java#L284

You can make some change there to see whether it can bypass the problem in
your case.

The root cause of this problem is, when Kylin decides whether to use TopN
or not, it didn't check whether there is "limit" and "order by" in the SQL
context. This was a known issue, because, at that time, the SQL parser
didn't push down such info to Kylin (I remembered).

One of the related JIRA is:
https://issues.apache.org/jira/browse/KYLIN-2620

Please comment on it. If more people think it is a problem, we will
investigate it.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Kang-Sen Lu <klu@anovadata.com> 于2019年1月10日周四 上午9:34写道:

> Hi, Na:
>
> Thanks for your reply. Let me explain why I created two cube for the same
> fact table.
>
> Normally the query is for accessing aggregated data such as
> SUM(<data-column>). But sometimes we need to query some days better defined
> with TOPN function, especially when th cardinality of the group by
> dimension in the top n is hugh. In that case, using normal cube would be
> asking kyli n to store high  fibroid table.
>
> So I was hoping the kylin cube candidate algorithm can based on the query
> and pick the right cube candidate.
>
> I am trying to study how kylin assigns cost and decided multiplier to each
> cube candidate. Since I don't much about kylin internal terminology, it is
> not obvious to me what the java code is trying to accomplish.
>
> Ideally, if in the query we can find the ingredient of applying top n
> function, then we should assign lower cost to top n cube,otherwise, we
> should avoid using top n for non-stop query. So I thought, the basic
> ingredient to apply top n cube is (1) select list contain s something like
> sum(x); (2) group by list include y; (3) there's should be order by sum(x);
> (4) there should be limit n. That should warrant the apply of topn(x),
> group by y metric.
>
> Thanks for any help to understand the cost evaluation of cube candidate.
>
> Kang-sen
> ------------------------------
> *From:* Na Zhai <na.zhai@kyligence.io>
> *Sent:* Wednesday, January 9, 2019 3:12:07 AM
> *To:* user@kylin.apache.org
> *Subject:* 答复: question about how kylin chooses a specific cube design
> over others
>
>
> Hi, Kang-Sen Lu.
>
>
>
> If multiple cubes contain the same table, Kylin will do the following
> thing.
>
>
>
> 1. Kylin will choose the cube that contains all the dimensions in your
> SQL.
>
> 2. if there are still more than one cubes, Kylin will choose the cube that
> has less cost(it depends on the dimension, measure and inner join number).
>
>
>
> 发送自 Windows 10 版邮件 <https://go.microsoft.com/fwlink/?LinkId=550986>应用
>
>
> ------------------------------
> *发件人:* Kang-Sen Lu <klu@anovadata.com>
> *发送时间:* Monday, January 7, 2019 11:11:09 PM
> *收件人:* user@kylin.apache.org
> *主题:* RE: question about how kylin chooses a specific cube design over
> others
>
>
> Let me clarify my topn query problem clearly. Here is the query:
>
>
>
> SELECT  (COUNT(DISTINCT ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))  FROM  ZETTICSDW.A_MA_HOURLY_V
> WHERE  ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND
> (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))
>
>
>
> I have a data model, ma_aggs_model.
>
>
>
> I have created two cube designs: ma_aggs_cube, ma_aggs_topn_cube.
>
>
>
> In ma_aggs_cube, I have configured metric like
> “COUNT_DISTINCT(SUBSCRIBER_ID), SUM(HITS),…
>
>
>
> In ma_aggs_topn_cube, I have configured metric like SUM(HITS), TOPN(HITS,
> GROUP BY SUBSCRIBER_ID), … But no COUNT_DISTINCT(SUBSCRIBER_ID) metric.
>
>
>
> When a query contains COUNT(DISTINCT(SUBSCRIBER_ID) is issued, kylin
> favored to use ma_aggs_topn_cube over ma_aggs_cube. If I disabled
> ma_aggs_topc_cube, then kylin chooses ma_aggs_cube.
>
>
>
> The question I have is why COUNT_DISTINCT(xyz) in select phrase would
> always favor the topn cube, even if there is no “LIMIT xyz” and no “ORDER
> BY SUBSCRIBER_ID INSC” in the query. What is the criterion in kylin which
> favors topn cube over non-topn cube?
>
>
>
> Thanks.
>
>
>
> Kang-sen
>
>
>
> *From:* Kang-Sen Lu <klu@anovadata.com>
> *Sent:* Monday, January 07, 2019 8:21 AM
> *To:* user@kylin.apache.org
> *Subject:* RE: question about how kylin chooses a specific cube design
> over others
>
>
>
> I want to add another data point: if I disable the topn cube design, then
> the same query would go through smoothly and correctly. Her eis the kylin
> log:
>
>
>
> 2019-01-07 08:18:16,256 INFO  [Query
> a78321d0-5913-758b-6eb1-db5a39f92735-134702] service.QueryService:387 : The
> original query:   SELECT  (COUNT(DISTINCT
> ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))  FROM  ZETTICSDW.A_MA_HOURLY_V
> WHERE  ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND
> (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))
>
>
>
> 2019-01-07 08:18:16,263 INFO  [Query
> a78321d0-5913-758b-6eb1-db5a39f92735-134702] service.QueryService:563 : The
> corrected query: SELECT  (COUNT(DISTINCT
> ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))  FROM  ZETTICSDW.A_MA_HOURLY_V
> WHERE  ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND
> (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))
>
> LIMIT 50000
>
> 2019-01-07 08:18:16,284 INFO  [Query
> a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:58 : Find
> candidates by table ZETTICSDW.A_MA_HOURLY_V and project=Anovadata :
> CUBE[name=ma_aggs_cube_6]
>
> 2019-01-07 08:18:16,285 INFO  [Query
> a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:51 :
> Applying rule: class
> org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule,
> realizations before: [CUBE[name=ma_aggs_cube_6]], realizations after:
> [CUBE[name=ma_aggs_cube_6]]
>
> 2019-01-07 08:18:16,285 INFO  [Query
> a78321d0-5913-758b-6eb1-db5a39f92735-134702] routing.QueryRouter:51 :
> Applying rule: class
> org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
> realizations before: [CUBE[name=ma_aggs_cube_6]], realizations after:
> [CUBE[name=ma_aggs_cube_6]]
>
> 2019-01-07 08:18:16,286 INFO  [Query
> a78321d0-5913-758b-6eb1-db5a39f92735-134702] rules.RealizationSortRule:40 :
> CUBE[name=ma_aggs_cube_6] priority 1 cost 279.
>
>
>
> Kang-sen
>
>
>
> *From:* Kang-Sen Lu <klu@anovadata.com>
> *Sent:* Monday, January 07, 2019 8:06 AM
> *To:* user@kylin.apache.org
> *Subject:* question about how kylin chooses a specific cube design over
> others
>
>
>
> I am using kylin 2.5.1. I have a data model and two cube designs on top of
> that data model.
>
>
>
> One data model is used to perform aggregation over a set of aggregation
> groups. So the metric are all using “COUNT DISTINCT”, and “SUM” functions.
> To speed up TOPN application, I have created another smaller cube design
> which addresses the TOPN application specifically, i.e. the metric contains
> only “COUNT DISTINCT”, and “TOPN”, but no “SUM” aggregation.
>
>
>
> Here is a normal query, and somehow I am surprised that the TOPN cube was
> sleected. That caused over 10,000,000 row of data being returned and failed.
>
>
>
> From the kylin.log, you can see the “SELECT” statement and the cost
> evaluation data. I am not sure what has caused the wrongly selection of the
> cube. I hope someone can provide me with some hint or references.
>
>
>
>
>
> 2019-01-07 07:55:46,137 INFO  [Query
> 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] service.QueryService:387 : The
> original query:  SELECT  (COUNT(DISTINCT
> ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.HITS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.PAGE_VIEWS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSIONS)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.SESSION_TIME)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.DOWN_BYTES)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.DATA_CONSUMED)),
> (SUM(ZETTICSDW.A_MA_HOURLY_V.UP_BYTES))  FROM  ZETTICSDW.A_MA_HOURLY_V
> WHERE  ((ZETTICSDW.A_MA_HOURLY_V.THEDATE >= '20180501') AND
> (ZETTICSDW.A_MA_HOURLY_V.THEDATE <= '20180501'))
>
>
>
> 2019-01-07 07:55:46,184 INFO  [Query
> 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:58 : Find
> candidates by table ZETTICSDW.A_MA_HOURLY_V and project=Anovadata :
> CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]
>
> 2019-01-07 07:55:46,185 INFO  [Query
> 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 :
> Applying rule: class
> org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule,
> realizations before:
> [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations
> after: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]]
>
> 2019-01-07 07:55:46,185 INFO  [Query
> 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 :
> Applying rule: class
> org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
> realizations before:
> [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]], realizations
> after: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]]
>
> 2019-01-07 07:55:46,185 INFO  [Query
> 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] rules.RealizationSortRule:40 :
> CUBE[name=ma_aggs_cube_6] priority 1 cost 279. CUBE[name=ma_aggs_topn_cube]
> priority 1 cost 105.
>
> 2019-01-07 07:55:46,186 INFO  [Query
> 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:51 :
> Applying rule: class
> org.apache.kylin.query.routing.rules.RealizationSortRule, realizations
> before: [CUBE[name=ma_aggs_cube_6],CUBE[name=ma_aggs_topn_cube]],
> realizations after: [CUBE[name=ma_aggs_topn_cube],CUBE[name=ma_aggs_cube_6]]
>
> 2019-01-07 07:55:46,186 INFO  [Query
> 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:95 :
> Adjust DimensionAsMeasure for FunctionDesc [expression=COUNT_DISTINCT,
> parameter=ZETTICSDW.A_MA_HOURLY_V.SUBSCRIBER_ID, returnType=null]
>
> 2019-01-07 07:55:46,186 INFO  [Query
> 06ddbf4a-8cef-1515-463e-e1067eaaae3a-134709] routing.QueryRouter:75 : The
> realizations remaining:
> [CUBE[name=ma_aggs_topn_cube],CUBE[name=ma_aggs_cube_6]],and the final
> chosen one for current olap context 0 is CUBE[name=ma_aggs_topn_cube]
>
>
>
> Thanks.
>
>
>
> Kang-sen
>

Mime
View raw message