kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Yang <liy...@apache.org>
Subject Re: Understanding aggregation group settings
Date Mon, 17 Jul 2017 20:14:25 GMT
I think the real question to answer is -- how would user query?

Once you can describe user's query pattern, it becomes smooth to discuss
how to optimize the cube using aggregation group, hierarchy, joint etc..

On Tue, Jul 11, 2017 at 10:58 PM, ShaoFeng Shi <shaofengshi@apache.org>
wrote:

> hi Stefan,
>
> You question is too long; today I can just answer your first question:
>
> Q:  "We have understood the concept behind derived dimensions and
> hierarchies, but we cannot figure out how to implement it with the given
> UI."
>
> Answer:
> 1) The "derived dimension" can only be from lookup tables, because kylin
> will take snapshot for lookup table, so with the snapshot in memory, kylin
> will be able to do a "deriving" operation from PK/FK to the other columns
> on lookup tables. Okay this is the background, now when you add a dimension
> from a lookup table, you will see there is an option letting you select
> "derived" or "normal". If you select "derived", it will put the FK column
> as Cube's physical dimension and remember using it to map this dimension at
> runtime. If you select "normal", then this column will be a physical
> dimension.
>
> 2) The hierarchy is a relationship among several dimensions. You need
> define the relationship in the "advanced settings" step; In a aggregation
> group, putting them into a "hierarchy" and in the sequence from parent to
> children. For example: country, city, street.
>
> You can check the sample cube that Kylin ships as an example.
>
>
> 2017-07-07 16:59 GMT+08:00 Hüls, Stefan <stefan.huels@tui.com>:
>
>> Hi there,
>>
>>
>>
>> we are currently investigating Kylin to help us bring our OLAP
>> environment to hadoop.
>>
>>
>>
>> We have a very traditional datamart approach modelled in our DWH, but
>> struggle to understand how to model cubes in Kylin.
>>
>> We have understood the concept behind derived dimensions and hierarchies,
>> but we cannot figure out how to implement it with the given UI.
>>
>> I hope you can give us some hints or insights on how to do an optimal
>> cube build.
>>
>>
>>
>> We have one fact table (F) which has measures and one ID attribute per
>> dimension table. The IDs are the FK for the dimension tables.
>>
>> Like in the "Optimize Cube Design" article, this is a typical example:
>>
>>
>>
>> Fact                                           Dimensions
>>
>> measure1, measure2, ..., measureX, FK1, FK2, FK3 <-inner join-> PK1
>> H1                      (D_TIME_DIMENSION)
>>
>>                                                  <-inner join-> PK2 H1,
>> H2, H3, H4, H5, H6  (D_GEOGRAPHY)
>>
>>                                                  <-inner join-> PK2 H1,
>> H2, H3, H4          (D_DATE)
>>
>> where PK has a 1 to 1 relationship to H1 in every dimension.
>>
>>
>>
>> 1.) How to model the dimension for D_TIME_DIMENSION. Is it PK as normal
>> and H1 as derived dimension, or do we ignore PK1 and just model H1 as
>> normal dimension?
>>
>> 2.) How to model the dimension for D_GEOGRAPHY. Are H1 to H6 normal
>> dimensions and should be modelled as a hierarchy later, or is any dimension
>> derived?
>>
>> 3.) The same question arises for D_DATE.
>>
>>
>>
>> The next questions arise regarding the aggregation groups. Since in your
>> description, every table attribute that is used in the cube model is called
>> "dimension", the
>>
>> question comes up which of these dimensions from the above example should
>> be included in aggregation groups and how.
>>
>>
>>
>> I would understand that if we do not define an aggregation group, we
>> would have 2^11 cuboids to be created (if every Hx dimension is "normal").
>>
>> Assume that for lookup table D_GEOGRAPHY, the H1 dimension has a high
>> cardinality and H2 to H6 are subgroups of H1 (aka. levels in a geo tree)
>>
>> H1 of D_TIME_DIMENSION is mandatory in all analyses.
>>
>> H1 of D_DATE is a classic date field, with H2 to H4 being WEEK, MONTH and
>> YEAR.
>>
>> How is the aggregation group UI meant to be used in this example? I
>> understand that aggregation groups are a white-list to reduce the number of
>> cuboids?
>>
>>
>>
>> In our current IBM Cognos Transformer Powercubes environment, we have
>> models with about 30 dimension tables and a total of about 90 levels. The
>> fact table has about 60 million rows.
>>
>> The final cube has a size of about 10 GB.
>>
>> This would result in about 2^30 to 2^90 cuboids in Kylin and is
>> impossible to produce. Do you know what kind of algorithm is used to
>> produce the IBM Powercubes?
>>
>> Access time is not very good on a lower aggregation layer, but the top
>> level aggregations are fast. The software is about 15 years old and his is
>> own issues, but
>>
>> building cubes with massive dimension numbers works fine as long as there
>> are not too many rows in the fact table and the filesize of a single file
>>
>> does not exceed 2GB at any time.
>>
>>
>>
>> We would like to somehow model these kind of aggregations, but we think
>> we would have to think different to achieve this.
>>
>> We currently need about 4 hours on a 2 core machine with 8GB RAM to
>> create these Cognos cubes. How is this even possible and what does Kylin do
>> different.
>>
>>
>>
>> We are currently using Kylin 2.0 on a Cloudera 5.11 cluster.
>>
>>
>>
>> Any help would be much appreciated.
>>
>> Best regards,
>>
>> Stefan
>>
>>
>>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Mime
View raw message