kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ShaoFeng Shi <shaofeng...@apache.org>
Subject Re: [DISCUSSION] Don't need to purge existing segment of cube to add new measures in Kylin
Date Fri, 26 Apr 2019 08:56:18 GMT
Hi Yuzhang,

Please open a JIRA for this enhancement; If it can be implemented in an
elegant way, that will be great!

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




yuzhang <shifengdefannao@163.com> 于2019年4月23日周二 上午8:56写道:

> Hi Shaofeng:
>     We also take some experiment for add measure after cube be built and
> encountered byte error at the very start. The default mapping strategy
> between HBase store and measure definition is "multiple measures are stored
> in one column of column family", which may cause byte error after add a
> measure and insert it in original measure sequence. Add an column for new
> measure may be better, I think.
>
>     I just have a preliminary idea, may be impractical for now, about the
> measure management design.
> Dimensions and metrics are defined once model be designed. The measure
> aggregate the metrics in different dimensions to observe the data entities
> represented by the model. All of these are design of 'logical view', I
> think. The Cube is materialized view of these logical model, which is the
> bridge between the logical view and the physical storage (and the highway
> is set up). The life cycle of the measure may depend on the model rather
> than the cube.
>
>     Based on the design, an measure management can be set up after model
> design be completed. We can define the measure based on model. Cubes under
> the model can reuse those measure and build their segment data. When a SQL
> arrive, Kylin query server need to find the suitable model with suitable
> measure, then find the available cube.
>
>     Of course, such an design change will have a very large impact on the
> existing kylin architecture, and the query and metadata will have very
> large changes. So it seems that it is still on paper.
>     More realistic or transitional design is increasing the metadata of
> the measure. Just as CubeDesc defines the schema, and a relative
> CubeInstance manages the built Segments. MeasureDesc can also has a
> MeasureInstance to manage the segment containing it.
> I observed that kylin's query service generates a GridTable for mapping
> between logical views and HBase physical storage: Cuboid + Measure -> Grid
> Table <- HBase store. This Grid Table is generated based on CubeDesc and
> has such a mapping process for each Segment. Therefore, in the mapping
> stage, it is possible to know which columns of the Grid Table can't be
> obtained in current segment by the metadata. So the measure data can be
> selectively read at the RS backend.
> But its life cycle is the same as MeasureDesc, managed by CubeDesc.
>
>     Regarding adding dimensions to the same cube, we also need to consider
> aggregation groups and Rowkey order. I am curious and interesting how you
> implemented it.
>
>
>
>                                                               Best regards
>
>
>                                                               yuzhang
>
> yuzhang
> shifengdefannao@163.com
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=yuzhang&uid=shifengdefannao%40163.com&iconUrl=http%3A%2F%2Fmail-online.nosdn.127.net%2Fsm1c0446ade9371d208d1e209c8bc0827f.jpg&items=%5B%22shifengdefannao%40163.com%22%5D>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81>
定制
> On 4/22/2019 09:05,ShaoFeng Shi<shaofengshi@apache.org>
> <shaofengshi@apache.org> wrote:
>
> Hi Yuzhang,
>
> Glad to see such a discussion; How to support "schema change" in a friendly
> way is what we should do in the next phase, as we see this requirement is
> stronger than before.
>
> Last week I also did a try on 1) adding a dimension after cube be built,
> and 2) adding a measure after cube be built;
>
> For 1) I have got an idea, the first try was successful, and want to
> discuss it with the community in some day.
>
> The 2) was failed; after a new measure is added, the query got failed and
> in HBase RS side there is byte parsing error. Then I didn't continue that.
>
> Could you elaborate your idea on "the measures of the analysis system can
> be decoupled from the materialized view(cube) and have their own management
> system"? Have you got a rough design on it? Thank you!
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> yuzhang <shifengdefannao@163.com> 于2019年4月21日周日 下午8:08写道:
>
> Hi JiaTao:
> Maybe it's necessary that there is an optional auto-complete machanism
> among different measure's view, isn't it?
>
>
> yuzhang
>
>
> | |
> yuzhang
> |
> |
> shifengdefannao@163.com
> |
> 签名由网易邮箱大师定制
> On 4/20/2019 11:38,JiaTao Tao<taojiatao@gmail.com> wrote:
> Hi
>
> The idea that supports Kylin adding measures dynamically is impressive.
>
> But in my opinion, once you add a measure, the existing segments should
> also calculate the new measure(just add a new measure column). Users can
> have many cubes, a cube can have many segments, if measure's view is
> different in each segment, it will increase the burden of the user.
>
> --
>
>
> Regards!
>
> Aron Tao
>
> yuzhang <shifengdefannao@163.com> 于2019年4月20日周六 上午1:43写道:
>
> Hi dear kylin users and develop team:
> Here have some things I want to discuss with community.
> As a representative of MOLAP engine, kylin uses pre-aggregation strategies
> to provide high-concurrency and second-level response analysis
> capabilities, but also loses some flexibility.
> The limitation that purge existing segment firstly to add an additional
> measure will cause many double calculation and unnecessary disk IO. Such
> waste should be avoid especially in MOLAP engine.
> For example, there is an cubeA with one measure m1 and segments over time
> range1(tr1). Now, user add one measure m2, but don't want to clear segments
> over tr1. The value of m2 will exist in tr2, the segments build
> subsequently. Sure, tr1 doesn't contain value of m2, which will be
> understanded by user who know litte about MOLAP. Querying over tr1 and tr2
> is valid for both m1 and m2, but the result of m2 over tr1 will be null.
> It's will be better to reminder user the measure missing.Moreover,
> refreshing will supply the m2 to segments over tr1.
> Currently, kylin's storage engine uses HBase. The measure are aggregated
> values based on combination of various dimension members and stored in a
> column of a Column Family in HBase. For the same cube, adding a new measure
> will add a column to the HBase table(mapping) and will take effect in the
> next build. For the existing HTables(segments), the new column is allowed
> to be missing. Refreshing old existing segments will add a new column in
> their HTable to store new measure. Value of new measure is aggregated
> according to the combination of dimension members in rowkey, without
> recalculating existing measure.
> Now, For additional measure and even additional dimensions, Kylin's
> current solution is Hybrid, but we found the following shortcomings during
> use:
> 1. Management costs: Repeated maintenance of similar Cubes, most of which
> have many intersections of dimensions and indicators. If you want to
> perform optimization operations such as pruning, you need to configure all
> of these cubes.
> 2. A large number of cubes: The initial analysis of the business is not
> stable, and analysts often have the need to increase some measures. The
> cube is added continuously to the Hybrid group, which will produce a lot of
> cubes.
> 3. Repeat calculation: If you want to drop the old cube in the Hybrid
> group, you need to build the latest cube by compute historical data to
> cover the old cube.
> Those will result in a lot of waste.
> In addition, I felt that the metadata about the measure was not perfect
> during the applying of Kylin.
> 1. As one of the most important concerns of analysts, if the measures of
> the analysis system can be decoupled from the materialized view(cube) and
> have their own management system, it may be more flexibility.
> 2. Once the dimensions have been choose in cube designing, it's cuboids
> are confirmed no matter the number of measures. It may make confuse to
> maintenance cubes with different measures but same cuboids. Cubes with
> different cuboids should be considered different cube, which is the
> definition of cube, isn't it?
> It's just some thinking about MOLAP during I using kylin. How do you think
> about this? Looking forward your reply, sincerely.
> Maybe here are some mistake or misunderstanding, please feel free to
> correct me or discuss further more if you find any of them.
> Best regards
> yuzhang
>
>
> yuzhang
> shifengdefannao@163.com
>
> <
>
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=yuzhang&uid=shifengdefannao%40163.com&iconUrl=http%3A%2F%2Fmail-online.nosdn.127.net%2Fsm1c0446ade9371d208d1e209c8bc0827f.jpg&items=%5B%22shifengdefannao%40163.com%22%5D
>
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81>
定制
>
>
>

Mime
View raw message