kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Long" <wayn...@qq.com>
Subject 回复: 回复:Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min
Date Fri, 21 Dec 2018 07:06:43 GMT
An even distribution means there is not a skew distribution. If data skew happen, there may
some task's execution time are very larger then average time. And the RedistributeFlatHiveTableStep
is to avoid data skew as far as possible, for more details you can see 
https://issues.apache.org/jira/browse/KYLIN-1656
https://issues.apache.org/jira/browse/KYLIN-1677


And the parameter "kylin.engine.mr.uhc-reducer-count" work for Mapreduce and Spark. In Spark,
a larger value means allocate more tasks. About what value should it be, I think you can see
the task execution state of "Extract Fact Table Distinct Columns" job in Spark UI and identify
the most time consuming task and give this parameter a suitable value. And about what exactly
it is, I don't know.



------------------
Best Regards,
Chao Long


------------------ 原始邮件 ------------------
发件人: "Jon Shoberg"<jon.shoberg@gmail.com>;
发送时间: 2018年12月21日(星期五) 上午10:34
收件人: "user"<user@kylin.apache.org>;

主题: Re: 回复:Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min



That’s great to know about step 2!

How would you define or determine an even distribution? This is a four node Hdfs cluster and
the bz2 files as the data source (external table) have a dfs distribution of 2. I’d imagine
the distribution would not be horrible on a small cluster. 


On the reducer could this is a spark setup. So on yarn I see this step running as a spark
job. Does a mar reduce setting such as this apply? If so what is a larger value. I think the
default here is 1 ... should it be 2,5,10,or 100? It’s a 4 node cluster with 10 cpus and
~550gb ram. 

Sent from my iPhoneX

On Dec 20, 2018, at 7:24 PM, Chao Long <wayne.l@qq.com> wrote:


Hi,
  If the data have an even distribution, you can set "kylin.source.hive.redistribute-flat-table=false"
to skip Step 2. And about Step 3, if you have many UHC dimension, you can set "kylin.engine.mr.uhc-reducer-count"
a larger value to use more reducer to handle dict.


------------------
Best Regards,
Chao Long


------------------ 原始邮件 ------------------
发件人: "Jon Shoberg"<jon.shoberg@gmail.com>;
发送时间: 2018年12月20日(星期四) 晚上10:20
收件人: "user"<user@kylin.apache.org>;

主题: Kylin w/ Spark - Build 626min - Steps 1/2/3 455min - Steps 4-8 - 171min



Question ...

  Is there a way to optimize the first three steps of a Kylin build?


  Total build time of a development cube is 626 minutes and a break down by steps:

87  min - Create Intermediate Flat Hive Table

207 min -  Redistribute Flat Hive Table

248 min -  Extract Fact Table Distinct Columns

0   min

0   min

62  min -  Build Cube with Spark

19  min -  Convert Cuboid Data to HFile

0   min

0   min

0   min

0   min
   The data set is summary files (~35M records) and detail files (~4B records - 40GB compressed).


   There is a join needed for the final data which is handled in a view within hive.  So I
do expect a performance cost there.


   However, staging the data other ways (loading to sequence/org file vs external table to
bz2 files) there is no net-gain.


   This means, pre-processing the data externally can make Kylin run a little faster but the
overall time from absolute start to finish is still ~600min.


   Steps 1/2 seem to be a redundancy given how my data is structured; the hsql/sql commands
Kylin sends to Hive could be done before the build process.


   Is it possible to optimize steps 1/2/3? Is it possible to skip steps 1/2 and jump to step
3 if the data was staged as-needed/correctly beforehand?


   My guess is there are mostly 'no' answers where (which is fine) but thought I'd ask.


   (The test lab is getting doubled in size today so I'm not ultimately worried but I'm seeking
other improvements vs. only adding hardware and networking)


Thanks! J
Mime
View raw message