kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ShaoFeng Shi <shaofeng...@apache.org>
Subject Re: Use derived or Joint
Date Mon, 12 Dec 2016 01:49:15 GMT
" define "year" - "IDDate" as a hierarchy", which is the benefit ?
--> The combination "year" + "IDDate" has the same line number as the
combination "IDDate"; so aggregate from the former to the latter will not
aggregate much; Then we can prune the later with the "hierarchy" to reduce
the cube size;

Nowadays, in derived columns, the Host column, is always the PK of table ?
--> Yes

2016-12-10 20:25 GMT+08:00 Alberto Ramón <a.ramonportoles@gmail.com>:

> thanks for you clear explanation !!
>
>
> The only point that I can't understand is
> " define "year" - "IDDate" as a hierarchy", which is the benefit ?
>       [image: Imágenes integradas 1]
> Where:
>
>    -
>
>    IDData is PK of Dim table, Unique & Identity
>    - Year is a Normal Dim --> I will have precalculated by years
>
>
> Nowadays, in derived columns, the Host column, is always the PK of table ?
>
>
>
>
> 2016-12-09 15:25 GMT+01:00 ShaoFeng Shi <shaofengshi@apache.org>:
>
>> Hi Albert, I think you're raising a good question; Many users face such
>> questions when using Kylin in their cases. Let me try to share some my
>> cents.
>>
>> "Derived" or "Joint" ?
>> These are two independent means in Kylin (they're not conflict). Using
>> which depends on how these dimensions being used I think;
>>
>> Take the "IDDate" case you mentioned as an example; If most of you
>> queries are aggregated at the PK/FK level (which is date), and user just
>> want to in passing other fields like its "MonthTxt", "DayWeekTxt", defining
>> them as "derived" will be very good.
>>
>> But if you also want to aggregate also at "MonthTxt" or "DayWeekTxt"
>> level, defining them as "Derived" might not be good; Because Kylin need
>> translates the condition of "MonthTxt" into a set of PK values ("IDDate"),
>> and then query from Cube with these values, because the cube only
>> pre-aggreated at "IDDate"; This will slow down the query; (Ofcause if your
>> dataset is small it still be acceptable)
>>
>> Besides, defining "Year" - "DayWeek_ID" - "Month_ID" as hierarchy is not
>> suggested, because they are not a hierarchy relationship, but in parallel
>> here;  ("March" is not a child of "2016", it appears in every year)
>>
>> "Joint" can be used in two typical cases:
>> 1) combine multiple ultra low cardinality dimensions
>> 2) combine dimensions which has 1:1 (like "Month_ID" "Month_Txt") or
>> close to 1:1 relationship (like "USER_ID", "USER_EMAIL")
>>
>> For case 1, I might design the cube in this way (assume you have the need
>> to group by year, month, dayweek):
>> 1) define all them as normal dimension
>> 2) define "year" - "IDDate" as a hierarchy
>> 3) define "Month_ID" + "Month_Txt" as a joint, because they're 1:1
>> 4) define "DayWeek_ID" + "DayWeek_Txt" as a joint, because they're 1:1
>>
>>
>> For case 2, I have the same suggestion as above.
>>
>> 2016-12-09 7:10 GMT+08:00 Alberto Ramón <a.ramonportoles@gmail.com>:
>>
>>> Typical case 1:
>>>
>>> *IDDate*
>>>
>>> *Month_ID*
>>>
>>> *Month_Txt*
>>>
>>> *DayWeek_ID*
>>>
>>> *DayWeek_Txt*
>>>
>>> *Year*
>>>
>>> 2016-03-01
>>>
>>> 3
>>>
>>> March
>>>
>>> 2
>>>
>>> Wendesday
>>>
>>> 2016
>>>
>>> 2016-03-02
>>>
>>> 3
>>>
>>> March
>>>
>>> 3
>>>
>>> Thursday
>>>
>>> 2016
>>>
>>> 2016-03-02
>>>
>>> 3
>>>
>>> March
>>>
>>> 4
>>>
>>> Friday
>>>
>>> 2016
>>>
>>> IDDate is PK of Dim table and Unique
>>>
>>>
>>> SOL 1: Uses Hierarchy and Derived from non PK column
>>>
>>>
>>> *Month_ID*
>>>
>>> Hierarchy 2
>>>
>>> Normal 1
>>>
>>> *Month_Txt*
>>>
>>>
>>> Derived 1
>>>
>>> *DayWeek_ID*
>>>
>>> Hierarchy 3
>>>
>>> Normal 2
>>>
>>> *DayWeek_Txt*
>>>
>>>
>>> Derived 2
>>>
>>> *Year*
>>>
>>> Hierarchy 1
>>>
>>> Normal 3
>>>
>>> Year > Month > Day
>>>
>>> Text are derived from ID (in month and Week)
>>>
>>> PB1: KYLIN-444 <https://issues.apache.org/jira/browse/KYLIN-444>
>>>
>>> PB2: I don't know how create Derived column from non PK with actual UI (Kylin
>>> – 1313 <https://issues.apache.org/jira/browse/KYLIN-1313> v1.5.2 Kylin
>>> 1786 <https://issues.apache.org/jira/browse/KYLIN-1786>, v1.5.3)
>>>
>>>
>>>
>>> SOL 2:
>>>
>>> *Month_ID*
>>>
>>> Hierarchy 2
>>>
>>> Join 1
>>>
>>> *Month_Txt*
>>>
>>>
>>> Join 1
>>>
>>> *DayWeek_ID*
>>>
>>> Hierarchy 3
>>>
>>> Join 2
>>>
>>> *DayWeek_Txt*
>>>
>>>
>>> Join 2
>>>
>>> *Year*
>>>
>>> Hierarchy 1
>>>
>>> Normal 3
>>>
>>>
>>> SOL 2 is this the best solution ??
>>>
>>>
>>>
>>> Typical case 2:
>>>
>>> I see the same scenario a lot of times (derived columns with 1:1
>>> Relation)
>>>
>>> Product_ID *(PK)*
>>>
>>> Product_TXT
>>>
>>> TypeProduct_ID
>>>
>>> TypeProduct_TXT
>>>
>>> Country_TXT
>>>
>>> Country_ID
>>>
>>> Optimize queries by product / category / country, are mandatory
>>>
>>> Perhaps,
>>>
>>> Country (lower cardinality) its a good candidate to Join
>>>
>>> I don't want put Product_TXT as Join, because is a longgggg text, and
>>> can be affect Row_Key of HBase, but I need Queries like ... where
>>> product_TXT = ""iRobot Roomba 650 Robotic Vacuum Cleaner
>>>
>>> suggestions ?
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Mime
View raw message