kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andras Nagy <andras.istvan.n...@gmail.com>
Subject Re: Re: Kylin streaming questions
Date Tue, 25 Jun 2019 09:20:55 GMT
Hi ShaoFeng,

Thanks a lot for the pointer on the lambda mode, yes, that's exactly what I
need :)

Is there perhaps documentation on this? For now, I was trying to get this
working 'empirically' and finally succeeded, but some of my conclusions may
be wrong. This is what I concluded:

- hive table must have the same name as the streaming table (name given to
the data source)
- cube can't be built from UI (to build the historic segments from the data
in hive), but it can be built using the REST API
- cube build engine must be mapreduce. For Spark as build engine I got
exception "Cannot adapt to interface
org.apache.kylin.engine.spark.ISparkOutput"
- endTime must be non-overlapping with the streaming data. When I had
overlap, the streaming data coming from kafka did not show up in the
output, I guess this is what you meant by "the segments from Hive will
overwrite the segments from Kafka".

Are these correct conclusions? Is there anything else I should be aware of?

Many thanks,
Andras

On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <shaofengshi@apache.org> wrote:

> Hello Andras,
>
> Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in
> https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which
> means, you can define a fact table whose data can be from both Kafka and
> Hive. The only requirement is that all the cube columns appear in both
> Kafka data and Hive data. I think maybe that can fit your need. The cube
> can be built from Kafka, in the meanwhile, it can also be built from Hive,
> the segments from Hive will overwrite the segments from Kafka (as usually
> Hive data is more accurate). When querying the cube, Kylin will firstly
> query historical segments, and then real-time segments (adding the max-time
> of historical segments as the condition).
>
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> Andras Nagy <andras.istvan.nagy@gmail.com> 于2019年6月24日周一 下午11:29写道:
>
>> Dear Ma,
>>
>> Thanks for your reply.
>>
>> Slightly related to my original question on the hybrid model, I was
>> wondering if it's possible to combine a batch and a streaming cube. I
>> realized this is not possible, as a hybrid model can only be created from
>> cubes of the same model (and a model points to either a batch or a
>> streaming datasource).
>>
>> The usecase would be this:
>> - we have a large amount of streaming data in Kafka that we would like to
>> process with Kylin streaming
>> - Kafka retention is only a few days, so if we need to change anything in
>> the cubes (e.g. introduce a new metric or dimension which has been present
>> in the events, but not in the cube definition), we can only reprocess a few
>> days worth of data in the streaming model
>> - the raw events are also written to a data lake for long-term storage
>> - the data written to the data lake could be used to feed the historic
>> data into a batch kylin model (and cubes)
>> - I'm looking for a way to combine these, so if we want to change
>> anything in the cubes, we can recalculate them for the historic data as well
>>
>> Is there a way to achieve this with current Kylin? (Without implementing
>> a custom query layer that combines the two cubes.)
>>
>> Best regards,
>> Andras
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <mg4work@163.com> wrote:
>>
>>> Hi Andras,
>>>
>>> Currently it doesn't support consume from specified offsets, only
>>> support consume from startOffset or latestOffset, if you want to consume
>>> from startOffset, you need to set the
>>> configuration: kylin.stream.consume.offsets.latest to false in the cube's
>>> overrides page.
>>>
>>> If you do need to start from specified offsets, please create a jira
>>> request, but I think it is hard for user to know what's the offsets should
>>> be set for all partitions.
>>>
>>> At 2019-06-13 22:34:59, "Andras Nagy" <andras.istvan.nagy@gmail.com>
>>> wrote:
>>>
>>> Dear Ma,
>>>
>>> Thank you very much!
>>>
>>> >1)yes, you can specify a configuration in the new cube, to consume
>>> data from start offset
>>> That is, an offset value for each partition of the topic? That would be
>>> good - could you please point me where to do this in practice, or point me
>>> to what I should read? (I haven't found it on the cube designer UI -
>>> perhaps this is something that's only available on the API?)
>>>
>>> Many thanks,
>>> Andras
>>>
>>>
>>>
>>> On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg4work@163.com> wrote:
>>>
>>>> Hi Andras,
>>>> 1)yes, you can specify a configuration in the new cube, to consume data
>>>> from start offset
>>>>
>>>> 2)It should work, but I haven't tested it yet
>>>>
>>>> 3)as I remember, currently we use Kafka 1.0 client library, so it is
>>>> better to use the version later, I'm sure that the version before 0.9.0
>>>> cannot work, but not sure 0.9.x can work or not
>>>>
>>>>
>>>>
>>>> Ma Gang
>>>> 邮箱:mg4work@163.com
>>>>
>>>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=Ma+Gang&uid=mg4work%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Amg4work%40163.com%22%5D>
>>>>
>>>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88>
定制
>>>>
>>>> On 06/13/2019 18:01, Andras Nagy <andras.istvan.nagy@gmail.com> wrote:
>>>> Greetings,
>>>>
>>>> I have a few questions related to the new streaming (real-time OLAP)
>>>> implementation.
>>>>
>>>> 1) Is there a way to have data reprocessed from kafka? E.g. I change a
>>>> cube definition and drop the cube (or add a new cube definition) and want
>>>> to have data that is still available on kafka to be reprocessed to build
>>>> the changed cube (or new cube)? Is this possible?
>>>>
>>>> 2) Does the hybrid model work with streaming cubes (to combine two
>>>> cubes)?
>>>>
>>>> 3) What is minimum kafka version required? The tutorial asks to install
>>>> Kafka 1.0, is this the minimum required version?
>>>>
>>>> Thank you very much,
>>>> Andras
>>>>
>>>>
>>>
>>>
>>>
>>

Mime
View raw message