kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ShaoFeng Shi <shaofeng...@apache.org>
Subject Re: Use derived or Joint
Date Fri, 09 Dec 2016 14:25:45 GMT
Hi Albert, I think you're raising a good question; Many users face such
questions when using Kylin in their cases. Let me try to share some my
cents.

"Derived" or "Joint" ?
These are two independent means in Kylin (they're not conflict). Using
which depends on how these dimensions being used I think;

Take the "IDDate" case you mentioned as an example; If most of you queries
are aggregated at the PK/FK level (which is date), and user just want to in
passing other fields like its "MonthTxt", "DayWeekTxt", defining them as
"derived" will be very good.

But if you also want to aggregate also at "MonthTxt" or "DayWeekTxt" level,
defining them as "Derived" might not be good; Because Kylin need translates
the condition of "MonthTxt" into a set of PK values ("IDDate"), and then
query from Cube with these values, because the cube only pre-aggreated at
"IDDate"; This will slow down the query; (Ofcause if your dataset is small
it still be acceptable)

Besides, defining "Year" - "DayWeek_ID" - "Month_ID" as hierarchy is not
suggested, because they are not a hierarchy relationship, but in parallel
here;  ("March" is not a child of "2016", it appears in every year)

"Joint" can be used in two typical cases:
1) combine multiple ultra low cardinality dimensions
2) combine dimensions which has 1:1 (like "Month_ID" "Month_Txt") or close
to 1:1 relationship (like "USER_ID", "USER_EMAIL")

For case 1, I might design the cube in this way (assume you have the need
to group by year, month, dayweek):
1) define all them as normal dimension
2) define "year" - "IDDate" as a hierarchy
3) define "Month_ID" + "Month_Txt" as a joint, because they're 1:1
4) define "DayWeek_ID" + "DayWeek_Txt" as a joint, because they're 1:1


For case 2, I have the same suggestion as above.

2016-12-09 7:10 GMT+08:00 Alberto Ramón <a.ramonportoles@gmail.com>:

> Typical case 1:
>
> *IDDate*
>
> *Month_ID*
>
> *Month_Txt*
>
> *DayWeek_ID*
>
> *DayWeek_Txt*
>
> *Year*
>
> 2016-03-01
>
> 3
>
> March
>
> 2
>
> Wendesday
>
> 2016
>
> 2016-03-02
>
> 3
>
> March
>
> 3
>
> Thursday
>
> 2016
>
> 2016-03-02
>
> 3
>
> March
>
> 4
>
> Friday
>
> 2016
>
> IDDate is PK of Dim table and Unique
>
>
> SOL 1: Uses Hierarchy and Derived from non PK column
>
>
> *Month_ID*
>
> Hierarchy 2
>
> Normal 1
>
> *Month_Txt*
>
>
> Derived 1
>
> *DayWeek_ID*
>
> Hierarchy 3
>
> Normal 2
>
> *DayWeek_Txt*
>
>
> Derived 2
>
> *Year*
>
> Hierarchy 1
>
> Normal 3
>
> Year > Month > Day
>
> Text are derived from ID (in month and Week)
>
> PB1: KYLIN-444 <https://issues.apache.org/jira/browse/KYLIN-444>
>
> PB2: I don't know how create Derived column from non PK with actual UI (Kylin
> – 1313 <https://issues.apache.org/jira/browse/KYLIN-1313> v1.5.2 Kylin
> 1786 <https://issues.apache.org/jira/browse/KYLIN-1786>, v1.5.3)
>
>
>
> SOL 2:
>
> *Month_ID*
>
> Hierarchy 2
>
> Join 1
>
> *Month_Txt*
>
>
> Join 1
>
> *DayWeek_ID*
>
> Hierarchy 3
>
> Join 2
>
> *DayWeek_Txt*
>
>
> Join 2
>
> *Year*
>
> Hierarchy 1
>
> Normal 3
>
>
> SOL 2 is this the best solution ??
>
>
>
> Typical case 2:
>
> I see the same scenario a lot of times (derived columns with 1:1 Relation)
>
> Product_ID *(PK)*
>
> Product_TXT
>
> TypeProduct_ID
>
> TypeProduct_TXT
>
> Country_TXT
>
> Country_ID
>
> Optimize queries by product / category / country, are mandatory
>
> Perhaps,
>
> Country (lower cardinality) its a good candidate to Join
>
> I don't want put Product_TXT as Join, because is a longgggg text, and can
> be affect Row_Key of HBase, but I need Queries like ... where product_TXT =
> ""iRobot Roomba 650 Robotic Vacuum Cleaner
>
> suggestions ?
>



-- 
Best regards,

Shaofeng Shi 史少锋

Mime
View raw message