kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ShaoFeng Shi <shaofeng...@gmail.com>
Subject Re: New document: "How to optimize cube build"
Date Mon, 06 Feb 2017 13:43:28 GMT
Ajay, thanks for your feedback;
For question 1, the code has been merged in master branch; next release would be 2.0; a beta
release will be published soon.
For question 2, yes your understanding is correct: a N dim FULL cube will have 2^N - 1 cuboids;
but if you adopted some way like hierarchy, joint or separating dimensions to multi groups,
it will be a "partial" cube which means some cuboids will be pruned. 
If a query uses dimensions across aggregation groups, then only the base cuboid can fulfill
it, kylin has to do the post aggregation from the base cuboid, the performance would be downgraded.
Please check whether it's this case in your side.
Get Outlook for iOS




On Mon, Feb 6, 2017 at 2:05 PM +0900, "Ajay Chitre" <chitre.ajay@gmail.com> wrote:










Thanks for writing this document. It's very helpful. I've following questions:

1) Doc says... "Kylin will build dictionaries in memory (in next version this will be moved
to MR)".

Which version can we expect this in? For large Cubes this process takes a long time on local
machine. We really need to move this to the Hadoop cluster. In fact, it will be great if we
can have an option to run this under Spark -:) 

2) About the "Build N-Dimension Cuboid" step.

Does Kylin build ALL Cuboids? My understanding is:

Total no. of Cuboids = (2 to the power of # of dimensions) - 1

Correct?

So if there are 7 dimensions, there will be 127 Cuboids, right? Does Kylin create ALL of them?

I was under the impression that, after some point, Kylin will just get measures from the Base
Cuboid; instead of building all of them. Please explain.

Thanks.



On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <liyang@apache.org> wrote:
Be free to update the document with different opinions. :-)

On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <shaofengshi@apache.org> wrote:
Hi Alberto,
Thanks for your comments! In many cases the data is imported to Hadoop in T+1 mode. Especially
when everyday's data is tens of GB, it is reasonable to partition the Hive table by date.
The problem is whether it worth to keep a long history data in Hive; Usually user only keep
a couple monthes' data in Hive; If the partition number exceeds the threshold in Hive, he/she
can remove the oldest partitions or move to another table easily; That is a common practice
of Hive I think, and it is very good to know that Hive 2.0 will solve this. 
2017-01-25 17:10 GMT+08:00 Alberto Ramón <a.ramonportoles@gmail.com>:
Be careful about partition by "FLIGHTDATE"

>From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance

"Option 1: Use id_date as partition column on Hive table. This have a big
 problem: the Hive metastore is meant for few hundred of partitions not 
thousand (Hive 9452 there is an idea to solve this isn’t in progress)"

In Hive 2.0 will be a preview (only for testing) to solve this

2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <shaofengshi@apache.org>:
Hello,
A new document is added for the practices of cube build. Any suggestion or comment is welcomed.
We can update the doc later with feedbacks;
Here is the link:https://kylin.apache.org/docs16/howto/howto_optimize_build.html

-- 
Best regards,
Shaofeng Shi 史少锋







-- 
Best regards,
Shaofeng Shi 史少锋












Mime
View raw message