kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ShaoFeng Shi <>
Subject Re: How to estimate resource cost according to data scale?
Date Thu, 16 Nov 2017 02:45:55 GMT
Hi Chase,

I see your Hadoop is AWS EMR; did you try EMR's auto-scaling rules?  Kylin
builds the cube on Hadoop in parallelly; If a big data set comes, Hadoop
will start more tasks than normal; If there are many pending tasks, EMR can
detect and then add new task nodes. This should help to improve the overall
building performance. But this may not as efficient as you expected (in 20

Is it possible to forecast a big data set will come and then call AWS API
to scale out the cluster? Besides, what's your build engine, MR or Spark?
You can switch to Spark to further reduce the building time.

2017-11-14 16:29 GMT+08:00 Chase Zhang <>:

> Hi all,
> This is Chase from Strikingly. Recently we're confronted with one problem
> upon the usage of Apache Kylin. Here is the description. Hoping anyone here
> could give some suggestions :)
> The problem is about the estimation of resource and time cost for one
> build of cube in proportion to data scale.
> Currently we have a task which will be triggered once per hour and the
> cube build will averagely cost 7-10 minutes or so. Per our business's
> growth, we need to plan an up scaling for our data platform in case the
> build time becomes too long.
> Thus, we're wondering if there is a good way to forecast the resource
> required to keep the same task's build time under 20 minutes if the data
> scale has enlarged, for example, 100 times. As we are not familiar to the
> underlying algorithm of Kylin, we're not sure how will Kylin actually
> perform upon our dataset.
> Do the develop team and other users in community have any experience or
> suggestions for this? Is there any articles for this specific problem?

Best regards,

Shaofeng Shi 史少锋

View raw message