kylin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Yang <liy...@apache.org>
Subject Re: proposal of cube building optimization
Date Tue, 03 Mar 2015 07:35:23 GMT
This proposal is the same as https://issues.apache.org/jira/browse/KYLIN-607
that I created earlier.

@宋轶, the difference to our very first POC is that, here the mapper outputs
the aggregated result of a small chunk of records, KVs of a micro segment,
not the very raw records.

In the ideal case, the solution could achieve 1 * [Total Cube Size]
shuffling when there's a mandatory dimension and each mapper takes a
different piece on the dimension. E.g. month is mandatory and each mapper
is assign a different month data. Then every mapper's output won't
duplicate. And the shuffle size is optimal.

Of course, in the worst case, the shuffle size might be times of the
current. So it really depends on the data set and aggregation config. What
we are seeing now is more often than not, date/time will be a mandatory
column, and if that's true, the new method will have an edge.

Cheers
Yang



On Mon, Mar 2, 2015 at 2:13 PM, 蒋旭 <jiangxu.china@qq.com> wrote:

> 1. One step building is more suitable for incremental building that has
> small data size. Full building on large data set can still use multiple
> stage building.
>
>
> 2. Since mapper will manage memory by itself, it will cache the
> intermediate result in memory as more as possible. Moreover, mapper will do
> preaggregation in memory just like combiner. In this way,  it should reduce
> the shuffle data size.
>
>
> 3. Since it's one step building, the data read size and job schedule
> latency should be much less.
>
>
> Thanks
> Jiang Xu
>
>
> ------------------ 原始邮件 ------------------
> 发件人: Ted Dunning <ted.dunning@gmail.com>
> 发送时间: 2015年03月02日 13:52
> 收件人: dev <dev@kylin.incubator.apache.org>
> 主题: Re: proposal of cube building optimization
>
>
>
> On Mon, Mar 2, 2015 at 6:47 AM, 宋轶 <yi.song@outlook.com> wrote:
>
> > The problem of it is that each mapper will generate too much intermediate
> > data, and the network will be the bottleneck in Shuffle phase
>
>
> This would prevent multiple passes over the input data.  Is there a
> difference in the amount of shuffled data from the amount that would be
> shuffled by multiple map-reduce steps?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message