kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiaoxiang Yu" <>
Subject Re: Re: Kylin streaming questions
Date Tue, 25 Jun 2019 15:42:12 GMT

Hi, Andras
    I am glad to see that you have have a strong understanding with Kylin's Realtime OLAP.
Most of them are correct, the following is my understanding:
    1)  Currently, there is no such documentation which talk about how to use lambda mode,
we will publish one after 3.0.0-beta release (maybe this wekend or after a week?).
    2)  Hive table must have the same name as the streaming table , and should be locate at
"default" namespace of hive. The column name should match exactly and data type should be
    3)  If you want to build segment which data from hive,  you have to built by rest api.
    4)  Cube build engine must be mapreduce, spark is not supported at the moment.

Best wishes to you ! 
From :Xiaoxiang Yu

At 2019-06-25 17:20:55, "Andras Nagy" <> wrote:

Hi ShaoFeng,

Thanks a lot for the pointer on the lambda mode, yes, that's exactly what I need :)

Is there perhaps documentation on this? For now, I was trying to get this working 'empirically'
and finally succeeded, but some of my conclusions may be wrong. This is what I concluded:

- hive table must have the same name as the streaming table (name given to the data source)
- cube can't be built from UI (to build the historic segments from the data in hive), but
it can be built using the REST API
- cube build engine must be mapreduce. For Spark as build engine I got exception "Cannot adapt
to interface org.apache.kylin.engine.spark.ISparkOutput"
- endTime must be non-overlapping with the streaming data. When I had overlap, the streaming
data coming from kafka did not show up in the output, I guess this is what you meant by "the
segments from Hive will overwrite the segments from Kafka".

Are these correct conclusions? Is there anything else I should be aware of?

Many thanks,

On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <> wrote:

Hello Andras,

Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in,
which means, you can define a fact table whose data can be from both Kafka and Hive. The only
requirement is that all the cube columns appear in both Kafka data and Hive data. I think
maybe that can fit your need. The cube can be built from Kafka, in the meanwhile, it can also
be built from Hive, the segments from Hive will overwrite the segments from Kafka (as usually
Hive data is more accurate). When querying the cube, Kylin will firstly query historical segments,
and then real-time segments (adding the max-time of historical segments as the condition).

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC

Apache Kylin FAQ:
Join Kylin user mail group:
Join Kylin dev mail group:

Andras Nagy <> 于2019年6月24日周一 下午11:29写道:

Dear Ma,

Thanks for your reply.

Slightly related to my original question on the hybrid model, I was wondering if it's possible
to combine a batch and a streaming cube. I realized this is not possible, as a hybrid model
can only be created from cubes of the same model (and a model points to either a batch or
a streaming datasource).

The usecase would be this:
- we have a large amount of streaming data in Kafka that we would like to process with Kylin
- Kafka retention is only a few days, so if we need to change anything in the cubes (e.g.
introduce a new metric or dimension which has been present in the events, but not in the cube
definition), we can only reprocess a few days worth of data in the streaming model
- the raw events are also written to a data lake for long-term storage
- the data written to the data lake could be used to feed the historic data into a batch kylin
model (and cubes)
- I'm looking for a way to combine these, so if we want to change anything in the cubes, we
can recalculate them for the historic data as well

Is there a way to achieve this with current Kylin? (Without implementing a custom query layer
that combines the two cubes.)

Best regards,

On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <> wrote:

Hi Andras,

Currently it doesn't support consume from specified offsets, only support consume from startOffset
or latestOffset, if you want to consume from startOffset, you need to set the configuration: to false in the cube's overrides page.

If you do need to start from specified offsets, please create a jira request, but I think
it is hard for user to know what's the offsets should be set for all partitions.

At 2019-06-13 22:34:59, "Andras Nagy" <> wrote:

Dear Ma,

Thank you very much!

>1)yes, you can specify a configuration in the new cube, to consume data from start offset
That is, an offset value for each partition of the topic? That would be good - could you please
point me where to do this in practice, or point me to what I should read? (I haven't found
it on the cube designer UI - perhaps this is something that's only available on the API?)

Many thanks,

On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <> wrote:

Hi Andras,
1)yes, you can specify a configuration in the new cube, to consume data from start offset

2)It should work, but I haven't tested it yet

3)as I remember, currently we use Kafka 1.0 client library, so it is better to use the version
later, I'm sure that the version before 0.9.0 cannot work, but not sure 0.9.x can work or

| |
Ma Gang

签名由 网易邮箱大师 定制

On 06/13/2019 18:01, Andras Nagy wrote:

I have a few questions related to the new streaming (real-time OLAP) implementation.

1) Is there a way to have data reprocessed from kafka? E.g. I change a cube definition and
drop the cube (or add a new cube definition) and want to have data that is still available
on kafka to be reprocessed to build the changed cube (or new cube)? Is this possible?

2) Does the hybrid model work with streaming cubes (to combine two cubes)?

3) What is minimum kafka version required? The tutorial asks to install Kafka 1.0, is this
the minimum required version?

Thank you very much,

View raw message