kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From BELLIER Jean-luc <>
Subject RE: usage of Web inteface Kylin an performances
Date Thu, 15 Feb 2018 14:57:34 GMT
Hello again,

Could you please give me a feedback on these questions ?

Thank you in advance for your help.  Have a good day.

Best regards,

De : BELLIER Jean-luc
Envoyé : mercredi 14 février 2018 09:06
À : '' <>
Objet : RE: usage of Web inteface Kylin an performances

Hello again,

I have other questions concerning :

1.       the building of cubes, and the way to use it through the admin interface.

I created a cube from scratch based on the kylin_model. I wanted to use some derived dimensions
such as YEAR_BEG_DT, MONTH_BEG_DT, WEEK_BEG_DT. I also put CAL_DT from KYLIN_CAL_DT table
and PART_DT from KYLIN_SALES table.
I defined my measures. That is OK.

But now, when I define an aggregation group, I want to use YEAR_BEG_DT, MONTH_BEG_DT, WEEK_BEG_DT
as hierarchy dimension, but I can't, because I suppose they are defined as 'derived' and no
So I let this field blank. But when I save the cube, the system tells me that there must be
at least 2 fields in hierarchy.
In the 'joint dimensions' section, what understand is that I use all the dimensions I want
to put in a "group by", whatever the type and nature (standard filed, FK)
SO I wonder what to put exactly in this aggregation groups.

Another question : when I go to the 'Insight' tab to run queries on the cube, I do not know
how to access the computed measures defined by running standard SQL Queries. I assume this
information is not accessible here, unless recomputing it in the query.

2.       The optimization of cube building
By default, MapReduce is used for cube generation. But I notice that you can use Spark. What
is the most efficient in terms of cube generation (time, space used) and performance for the
final user ?

3.       Data Storage
I notice that information was also stored in HDFS under /hbase folder. I presume this contains
all the cube data used and computed during the phase of cube generation. I am right ? So what
kind of information is stored in HBase, and how do these two storage spaces communicate ?

I know there are a lot of questions here and in my previous mails, but I really have to understand
how it works to check if this tool can fit to our business needs.

Thank you for your help. Have a good day.

Best regards,

De : BELLIER Jean-luc
Envoyé : mardi 13 février 2018 18:50
À : '' <<>>
Objet : usage of Web inteface Kylin an performances


I have several questions on Kylin, especially about performances and how to manage them. I
would like to understand precisely how it works to see if I can use it in my business context.

I come from the relational database world, so as far as I understand on OLAP, the searches
are performed on the values of primary keys in dimensions. These subsets are then joined to
get the corresponding lines on the facts table. As the dimensions tables are much smaller
than the facts table, the queries run faster

1.       Questions on performances

·         the raw data are stored in Hive, and the models and structures (cubes) are stored
in HBase; I presume that the whole .json files are stored, is it right ?

·         Where are the cube results stores (I mean after a build, a refresh or an append
action). Is it also in HBase ? I can see in HBase tables like "KYLIN_FF46WDAAGH". Do these
kinds of tables contain the cube data ?

·         I noticed that when I build the 'sample_cube', the volume of data was very important
compared to the size of the original files. Is there a way to reduce it (I saw a attribute
in the $KYLIN_HOME/tomcat/conf/server.xml file, called 'compression' for the connector on
port 7070, but I do not know if it is related to the cube size). I tried to change this parameter
to 'yes', but I noticed a huge increase of the duration of cube generation. So I am wondering
if it is the good method.

·         How is it possible to optimize cube size to keep good performance ?

·         In Hive, putting indexes is not recommended. So how Kylin is ensuring good performance
when querying high volumes of data  ? Is it through the 'rowkeys' in the advanced settings
when you build the cube ?
Or is the answer elsewhere ?

2.       Questions on cube building

·         By the way, the 'Advanced settings' step is still unclear for me. I tried to build
a cube from scratch using the tables provided in the sample project. But I do not know very
much what to put in this section.

·         My goal is to define groups of data on YEAR_BEG_DT, QTR_BEG_DT,MONTH_BEG_DT.

·         I do not understand very well why the aggregation group contains so many columns.
I tried to remove as many as possible, but when I tried to set up the joins, but some fields
were missing so the saving of the cube failed.

·         What shall we put exactly in the 'Rowkeys' section ? I understand that this is
used to define data encoding (for speed access ? ).Am I right ?

·         Are the aggregation groups used for speed of the queries. I assume it is the case,
because it represents the most commonly used associations of columns for the cube.

Thank you in advance for your help.

Best regards,

"Ce message est destiné exclusivement aux personnes ou entités auxquelles il est adressé
et peut contenir des informations privilégiées ou confidentielles. Si vous avez reçu ce
document par erreur, merci de nous l'indiquer par retour, de ne pas le transmettre et de procéder
à sa destruction.

This message is solely intended for the use of the individual or entity to which it is addressed
and may contain information that is privileged or confidential. If you have received this
communication by error, please notify us immediately by electronic mail, do not disclose it
and delete the original message."

View raw message