From user-return-4456-archive-asf-public=cust-asf.ponee.io@kylin.apache.org Tue Jun 25 09:21:36 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 188A018062F for ; Tue, 25 Jun 2019 11:21:35 +0200 (CEST) Received: (qmail 40002 invoked by uid 500); 25 Jun 2019 09:21:35 -0000 Mailing-List: contact user-help@kylin.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kylin.apache.org Delivered-To: mailing list user@kylin.apache.org Received: (qmail 39991 invoked by uid 99); 25 Jun 2019 09:21:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jun 2019 09:21:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id AB1A61800C9 for ; Tue, 25 Jun 2019 09:21:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.802 X-Spam-Level: ** X-Spam-Status: No, score=2.802 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=2, KAM_SHORT=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id HAIRLwXemLoz for ; Tue, 25 Jun 2019 09:21:32 +0000 (UTC) Received: from mail-io1-f67.google.com (mail-io1-f67.google.com [209.85.166.67]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 6B5035F491 for ; Tue, 25 Jun 2019 09:21:32 +0000 (UTC) Received: by mail-io1-f67.google.com with SMTP id i10so331417iol.13 for ; Tue, 25 Jun 2019 02:21:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=kq9X+XMAT2RezEfazC0KIE3JPASvm5TM6JzhQyOfE5Y=; b=d1MZff8p6LBmLWDZfx642I6mvFOlrLu8auCcgFqEHnz7r1iVV1oqFG/uMr9m5lD6Nj Scm7uDELzcoC2yLHpugnmdxjCdku3PnMUGyrwkq43sItOyoGBk60RRG9yx81AsYzIdFD qANfZOyatXV1/gOt8+Jl3c1ZRxZ7CquILFVRt4K8Oxc3qB6nZNQMw4+QUgs92OUKcQkK CV2rtUjiB3nTN07URIrcCBLDCuq83gtRGW3NUvttyPeDb1qTNKxn10K1UjCRzsWCeWmG /BEjEwOtsTCVNXq5HyaErB8cL9M53MM2ciMWumpaYUp7vpyt7V9CwIHMcgg0jwhX+E15 O2Iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=kq9X+XMAT2RezEfazC0KIE3JPASvm5TM6JzhQyOfE5Y=; b=q2tk9ahcd+z+H3301EFSdnJiF5pim87QA4CsiLVTnp4RMZacT5XlB6/NLNwHZnwFpF bFCo/KbanamTS9jPx9gEJTOBJ3AZOtEvkyqTSiEAsFjyXFMGoibFeciCaB6SKGqQw4/M kVP3lalPVaVTN1Wp8/FWkYL7D+VQVsnPJ8mxv95AH7H+++W4zVdfiAoC8ASxt7d5svEe JLa9TBU19+MfRMdkTglTzcTMSIORubb62LztUiDQj1+TCuoFILjUhP3DTjH3t1rneGe8 vg7pt7KDRgdT5CWh+NmuCLHaMAw3VRCqFWnnOZ2ZGcEP6D75/XoWHZvxBdqtCQgBo64d X1fg== X-Gm-Message-State: APjAAAX/AdYrkTLGQYrnacmNPyjobMb8ac8VaV88uvGrlFd2sJvcIg1Z 4oiruR2DOG7o2HDcL9M0hnQSHy0Rqh34XAI9YOlq9PeGmX4= X-Google-Smtp-Source: APXvYqy53c21+IHfA4Fo0MNpcALpRChXuTY4wtSJOLrYx6f0esMyWTjN+PWuWPoQTW+3PTSgbvSigQNXsnL6QsOmh4k= X-Received: by 2002:a02:554a:: with SMTP id e71mr30397587jab.144.1561454491499; Tue, 25 Jun 2019 02:21:31 -0700 (PDT) MIME-Version: 1.0 References: <2b1c8d4.4c72.16b508bb557.Coremail.mg4work@163.com> <153ebe88.966f.16b544baf61.Coremail.mg4work@163.com> In-Reply-To: From: Andras Nagy Date: Tue, 25 Jun 2019 11:20:55 +0200 Message-ID: Subject: Re: Re: Kylin streaming questions To: user@kylin.apache.org Content-Type: multipart/alternative; boundary="0000000000004c7f1c058c22744f" --0000000000004c7f1c058c22744f Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi ShaoFeng, Thanks a lot for the pointer on the lambda mode, yes, that's exactly what I need :) Is there perhaps documentation on this? For now, I was trying to get this working 'empirically' and finally succeeded, but some of my conclusions may be wrong. This is what I concluded: - hive table must have the same name as the streaming table (name given to the data source) - cube can't be built from UI (to build the historic segments from the data in hive), but it can be built using the REST API - cube build engine must be mapreduce. For Spark as build engine I got exception "Cannot adapt to interface org.apache.kylin.engine.spark.ISparkOutput" - endTime must be non-overlapping with the streaming data. When I had overlap, the streaming data coming from kafka did not show up in the output, I guess this is what you meant by "the segments from Hive will overwrite the segments from Kafka". Are these correct conclusions? Is there anything else I should be aware of? Many thanks, Andras On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi wrote= : > Hello Andras, > > Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in > https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which > means, you can define a fact table whose data can be from both Kafka and > Hive. The only requirement is that all the cube columns appear in both > Kafka data and Hive data. I think maybe that can fit your need. The cube > can be built from Kafka, in the meanwhile, it can also be built from Hive= , > the segments from Hive will overwrite the segments from Kafka (as usually > Hive data is more accurate). When querying the cube, Kylin will firstly > query historical segments, and then real-time segments (adding the max-ti= me > of historical segments as the condition). > > > Best regards, > > Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B > Apache Kylin PMC > Email: shaofengshi@apache.org > > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html > Join Kylin user mail group: user-subscribe@kylin.apache.org > Join Kylin dev mail group: dev-subscribe@kylin.apache.org > > > > > Andras Nagy =E4=BA=8E2019=E5=B9=B46=E6=9C= =8824=E6=97=A5=E5=91=A8=E4=B8=80 =E4=B8=8B=E5=8D=8811:29=E5=86=99=E9=81=93= =EF=BC=9A > >> Dear Ma, >> >> Thanks for your reply. >> >> Slightly related to my original question on the hybrid model, I was >> wondering if it's possible to combine a batch and a streaming cube. I >> realized this is not possible, as a hybrid model can only be created fro= m >> cubes of the same model (and a model points to either a batch or a >> streaming datasource). >> >> The usecase would be this: >> - we have a large amount of streaming data in Kafka that we would like t= o >> process with Kylin streaming >> - Kafka retention is only a few days, so if we need to change anything i= n >> the cubes (e.g. introduce a new metric or dimension which has been prese= nt >> in the events, but not in the cube definition), we can only reprocess a = few >> days worth of data in the streaming model >> - the raw events are also written to a data lake for long-term storage >> - the data written to the data lake could be used to feed the historic >> data into a batch kylin model (and cubes) >> - I'm looking for a way to combine these, so if we want to change >> anything in the cubes, we can recalculate them for the historic data as = well >> >> Is there a way to achieve this with current Kylin? (Without implementing >> a custom query layer that combines the two cubes.) >> >> Best regards, >> Andras >> >> >> >> >> >> >> >> >> >> >> On Fri, Jun 14, 2019 at 6:43 AM Ma Gang wrote: >> >>> Hi Andras, >>> >>> Currently it doesn't support consume from specified offsets, only >>> support consume from startOffset or latestOffset, if you want to consum= e >>> from startOffset, you need to set the >>> configuration: kylin.stream.consume.offsets.latest to false in the cube= 's >>> overrides page. >>> >>> If you do need to start from specified offsets, please create a jira >>> request, but I think it is hard for user to know what's the offsets sho= uld >>> be set for all partitions. >>> >>> At 2019-06-13 22:34:59, "Andras Nagy" >>> wrote: >>> >>> Dear Ma, >>> >>> Thank you very much! >>> >>> >1)yes, you can specify a configuration in the new cube, to consume >>> data from start offset >>> That is, an offset value for each partition of the topic? That would be >>> good - could you please point me where to do this in practice, or point= me >>> to what I should read? (I haven't found it on the cube designer UI - >>> perhaps this is something that's only available on the API?) >>> >>> Many thanks, >>> Andras >>> >>> >>> >>> On Thu, Jun 13, 2019 at 1:14 PM Ma Gang wrote: >>> >>>> Hi Andras, >>>> 1)yes, you can specify a configuration in the new cube, to consume dat= a >>>> from start offset >>>> >>>> 2)It should work, but I haven't tested it yet >>>> >>>> 3)as I remember, currently we use Kafka 1.0 client library, so it is >>>> better to use the version later, I'm sure that the version before 0.9.= 0 >>>> cannot work, but not sure 0.9.x can work or not >>>> >>>> >>>> >>>> Ma Gang >>>> =E9=82=AE=E7=AE=B1=EF=BC=9Amg4work@163.com >>>> >>>> >>>> >>>> =E7=AD=BE=E5=90=8D=E7=94=B1 =E7=BD=91=E6=98=93=E9=82=AE=E7=AE=B1=E5=A4= =A7=E5=B8=88 =E5=AE= =9A=E5=88=B6 >>>> >>>> On 06/13/2019 18:01, Andras Nagy wrote: >>>> Greetings, >>>> >>>> I have a few questions related to the new streaming (real-time OLAP) >>>> implementation. >>>> >>>> 1) Is there a way to have data reprocessed from kafka? E.g. I change a >>>> cube definition and drop the cube (or add a new cube definition) and w= ant >>>> to have data that is still available on kafka to be reprocessed to bui= ld >>>> the changed cube (or new cube)? Is this possible? >>>> >>>> 2) Does the hybrid model work with streaming cubes (to combine two >>>> cubes)? >>>> >>>> 3) What is minimum kafka version required? The tutorial asks to instal= l >>>> Kafka 1.0, is this the minimum required version? >>>> >>>> Thank you very much, >>>> Andras >>>> >>>> >>> >>> >>> >> --0000000000004c7f1c058c22744f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi ShaoFeng,

Thanks a lot for the = pointer on the lambda mode, yes, that's exactly what I need :)

I= s there perhaps documentation on this? For now, I was trying to get this wo= rking 'empirically' and finally succeeded, but some of my conclusio= ns may be wrong. This is what I concluded:

- hive table must have th= e same name as the streaming table (name given to the data source)
- cub= e can't be built from UI (to build the historic segments from the data = in hive), but it can be built using the REST API
- cube build engine mus= t be mapreduce. For Spark as build engine I got exception "Cannot adap= t to interface org.apache.kylin.engine.spark.ISparkOutput"
- endTim= e must be non-overlapping with the streaming data. When I had overlap, the = streaming data coming from kafka did not show up in the output, I guess thi= s is what you meant by "the segments from Hive will overwrite the segm= ents from Kafka".

Are these correct conclusions? Is there anyth= ing else I should be=C2=A0aware of?

Many thanks,
Andras
=
On Tue= , Jun 25, 2019 at 9:19 AM ShaoFeng Shi <shaofengshi@apache.org> wrote:
Hello Andras,

Kylin's realtime-OLAP feature supports a "Lambda" mod= e (mentioned in=C2=A0https://kylin.apache.org/blog/2019/04= /12/rt-streaming-design/), which means, you can define a fact table who= se data can be from both Kafka and Hive. The only requirement is that all t= he cube columns appear in both Kafka data and Hive data. I think maybe that= can fit your need. The cube can be built from Kafka, in the meanwhile, it = can also be built from Hive, the segments from Hive will overwrite the segm= ents from Kafka (as usually Hive data is more accurate). When querying the = cube, Kylin will firstly query historical segments, and then real-time segm= ents (adding the max-time of historical segments as the condition).


Best regards,

<= /div>
Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B
Apache Kylin P= MC

Apache Kylin FAQ:= =C2=A0https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org

<= /div>



Andras Nagy <andras.istvan.nagy@gmail.com> =E4=BA=8E2019=E5=B9=B46=E6= =9C=8824=E6=97=A5=E5=91=A8=E4=B8=80 =E4=B8=8B=E5=8D=8811:29=E5=86=99=E9=81= =93=EF=BC=9A
Dear Ma,

Thanks for your reply.
=
Slightly related to my original=C2=A0question on the hybrid = model, I was wondering if it's possible to combine a batch and a stream= ing cube. I realized this is not possible, as a hybrid model can only be cr= eated from cubes of the same model (and a model points to either a batch or= a streaming datasource).

The usecase would be thi= s:
- we have a large amount of streaming data in Kafka that we wo= uld like to process with Kylin streaming
- Kafka retention is onl= y a few days, so if we need to change anything in the cubes (e.g. introduce= a new metric or dimension which has been present in the events, but not in= the cube definition), we can only reprocess a few days worth of data in th= e streaming model
- the raw events are also written to a data lak= e for long-term storage
- the data written to the data lake could= be used to feed the historic data into a batch kylin model (and cubes)
- I'm looking for a way to combine these, so if we want to chang= e anything in the cubes, we can recalculate them for the historic data as w= ell

Is there a way to achieve this with current Ky= lin? (Without implementing a custom query layer that combines the two cubes= .)

Best regards,
Andras










On Fri, Jun 14, 2019 at 6:= 43 AM Ma Gang <mg4w= ork@163.com> wrote:
Hi Andras,

Currently it does= n't support consume from specified offsets, only support consume from s= tartOffset or latestOffset, if you want to consume from startOffset, you ne= ed to set the configuration:=C2=A0kylin.stream.consume.offsets.latest to fa= lse in the cube's overrides page.

If you do need to= start from specified offsets, please create a jira request, but I think it= is hard for user to know what's the offsets should be set for all part= itions.

At 2019-06-13 22:34:59, "Andras Nagy" <andras.istvan.nagy@= gmail.com> wrote:
Dear Ma,

Thank you very muc= h!

>1)yes, you can specify= a configuration in the new cube, to consume data from start offset<= /div>
That is, an offset value for each partition = of the topic? That would be good - could you please point me where to do th= is in practice, or point me to what I should read? (I haven't found it = on the cube designer UI - perhaps this is something that's only availab= le on the API?)

= Many thanks,
And= ras



On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg4work@163.com> wr= ote:
=20
Hi Andras,
1)yes, you can specify a= configuration in the new cube, to consume data from start offset

2)= It should work, but I haven't tested it yet

3)as I remember, cur= rently we use Kafka 1.0 client library, so it is better to use the version = later, I'm sure that the version before 0.9.0 cannot work, but not sure= 0.9.x can work or not



Ma= Gang
=E9=82=AE=E7=AE=B1=EF=BC=9Amg4work@163.com

=E7=AD=BE= =E5=90=8D=E7=94=B1 =E7=BD=91=E6=98=93=E9=82=AE=E7=AE=B1=E5=A4=A7=E5=B8=88 =E5=AE=9A= =E5=88=B6

On 06/13/2019 18:01, Andras N= agy wrote:
Greetings,

I have a few questions relat= ed to the new streaming (real-time OLAP) implementation.

1) Is there a way to have data reprocessed from kafka? E.g. I change= a cube definition and drop the cube (or add a new cube definition) and wan= t to have data that is still available on kafka to be reprocessed to build = the changed cube (or new cube)? Is this possible?

2) Do= es the hybrid model work with streaming cubes (to combine two cubes)?
3) What is minimum kafka version required? The tutorial asks to = install Kafka 1.0, is this the minimum required version?

Thank you very much,
Andras


=C2=A0

--0000000000004c7f1c058c22744f--