From dev-return-14616-archive-asf-public=cust-asf.ponee.io@kylin.apache.org Thu Nov 1 08:08:52 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 7905A180652 for ; Thu, 1 Nov 2018 08:08:51 +0100 (CET) Received: (qmail 84212 invoked by uid 500); 1 Nov 2018 07:08:50 -0000 Mailing-List: contact dev-help@kylin.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kylin.apache.org Delivered-To: mailing list dev@kylin.apache.org Received: (qmail 84201 invoked by uid 99); 1 Nov 2018 07:08:50 -0000 Received: from mail-relay.apache.org (HELO mailrelay2-lw-us.apache.org) (207.244.88.137) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Nov 2018 07:08:50 +0000 Received: from mail-io1-f52.google.com (mail-io1-f52.google.com [209.85.166.52]) by mailrelay2-lw-us.apache.org (ASF Mail Server at mailrelay2-lw-us.apache.org) with ESMTPSA id 835F625F7 for ; Thu, 1 Nov 2018 07:08:49 +0000 (UTC) Received: by mail-io1-f52.google.com with SMTP id o19-v6so11544145iod.3 for ; Thu, 01 Nov 2018 00:08:49 -0700 (PDT) X-Gm-Message-State: AGRZ1gKzk1lsx9vNzUIOYGKZAgnUIdI53fIDfZSpR0Ux+rUhuRz8W+Nc GLEPTdPVtTKBQwNcZkDPTznJLMxjSfkrrGBtJPQ= X-Google-Smtp-Source: AJdET5cbF25tYB9xragrBimPP8gMZLIbudsCiDoJ480O81D2C5aRj4T+ZPlAcaen3G3zorkoR4Q9AB6ienVtM4gjmYo= X-Received: by 2002:a6b:ba54:: with SMTP id k81-v6mr4086713iof.135.1541056128900; Thu, 01 Nov 2018 00:08:48 -0700 (PDT) MIME-Version: 1.0 References: <2b625a6c.da33.166c3db9d30.Coremail.mg4work@163.com> <13B55544-0CDF-472C-9D10-3DEB2EF28D19@kyligence.io> <7f83d2e0.101f9.166ce05501f.Coremail.mg4work@163.com> In-Reply-To: <7f83d2e0.101f9.166ce05501f.Coremail.mg4work@163.com> From: ShaoFeng Shi Date: Thu, 1 Nov 2018 15:08:12 +0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Re: [DISCUSS] New Kylin Streaming Solution From eBay To: dev Content-Type: multipart/alternative; boundary="000000000000246a3f05799517d3" --000000000000246a3f05799517d3 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Gang, Thank you for the information, that is helpful for understanding the overall design and implementation. Do you have some statistical information, like performance, throughput, stability, etc.? Besides, what's the plan of contributing it to the community? Thanks! Ma Gang =E4=BA=8E2018=E5=B9=B411=E6=9C=881=E6=97=A5=E5=91= =A8=E5=9B=9B =E4=B8=8B=E5=8D=882:45=E5=86=99=E9=81=93=EF=BC=9A > Thanks Xiaoxiang, > Very good questions! Please see my comments started with [Gang]: > > > 1. Is it possible to use Yarn as cluster manager for index task. > Coordinator process will set up them at specificed period. > [Gang] I think it is possible, but in current design, the indexing task > is designed as long running task, it also can provide query service, this > makes the whole system very simple and efficiency, I don't think we need = to > stop/start indexing task time by time. But use yarn to manage the resourc= e > is possible, we need to redesign the existing coordinator, to make it eas= y > to deploy to Yarn, Kubernetes, etc. Hope this can be done after > contribution to community. > > 2. As I know, ebay=E2=80=99s New Kylin Streaming Solution use replic= a Set to > ensure that income messages wouldn=E2=80=99t lost if some processes lost= . I think > replica set is a set of kafka cosumer processes which is responsible for > ingest message and build base cuboid in memory. Could you please show me > some detail about how replica Set provide HA guarantee? How to configure > it? A link / paper is OK. I found one but I don=E2=80=99t know if it sam= e meaning > for your replica Set. > > > [Gang] Yes, it is similar as the MongoDB replication, but currently we > don't replicate data from Primary node, just assign the same Kafka > topic/partitions to the receivers in a ReplicaSet, all receivers in a > ReplicaSet will consume data from Kafka, so if one receiver is down, othe= r > receivers in the ReplicaSet are still consuming the same Kafka data, so t= he > consume/query will not be impact. And We don't guarantee that the receive= rs > in a ReplicaSet have the same consuming rate, but we can guarantee that t= he > user can view data consistently by stick to the query to one receiver for > one cube. > The HA implementation is a little bit naive, but simple and worked. Maybe > in the future, we can do HA by replication to support other streaming > sources that don't support multiple consumers and don't have persistent > store. > > 3. How to add or remove node of replica Set in production env? How t= o > monitor the health/pressure of replica Set cluster ? > [Gang] Currently we have UI/restful api to let admin to add/remove node > to/from a ReplicaSet, and have a simple ui to let admin monitor the healt= h, > consuming rate for each receiver/cube. Also all metrics are collected usi= ng > yammer metrics framework, it is easy to exposed to other monitor system. > > 4. Does all measure are supported in ebay=E2=80=99s New Kylin Stream= ing > Solution? What about count distinct(bitmap)? > [Gang] Most measures are supported, but precise count distinct(bitmap) is > not support in case that the distinct dimension is not int type. As you > know, to support precise count distinct for not-int type dimension, it > needs to build global dictionary, it is not possible in the streaming env= . > > > 5. It seems ebay=E2=80=99s New Kylin Streaming Solution use a custom= columnar > storage, why not use a open source mature columnar storage solution ? Ha= ve > your ever compare the performance of your custom columnar storage to open > source columnar storage solution ? > > [Gang] Most open source columnar format like Parquet, ORC are designed to > use in Hadoop env, the streaming data are in local disk, so I didn't > consider them at the beginning. It is not very hard to define columnar > format to store Kylin specific data, use a customize columnar storage, yo= u > can use mmap file to scan data, add row-level invert index for all > dimensions, so I think the performance will be better compared to using > common columnar format. I didn't compare the performance, but the storage > engine is pluggable, you may contribute a parquet storage if you are > interesting. > > > > > > > At 2018-11-01 12:42:25, "Xiaoxiang Yu" wrote: > >Hi gang, I am so glad to know that eBay has a solution for realtime olap > on kylin. I have some small question: > > > > > >1. Is it possible to use Yarn as cluster manager for index task. > Coordinator process will set up them at specificed period. Yarn will mana= ge > : > > > >a) retry these task if some failed > > > >b) resource allocation > > > >c) log collection > > > >2. As I know, ebay=E2=80=99s New Kylin Streaming Solution use repli= ca Set to > ensure that income messages wouldn=E2=80=99t lost if some processes lost= . I think > replica set is a set of kafka cosumer processes which is responsible for > ingest message and build base cuboid in memory. Could you please show me > some detail about how replica Set provide HA guarantee? How to configure > it? A link / paper is OK. I found one but I don=E2=80=99t know if it sam= e meaning > for your replica Set. > > > >a) [Mongodb replication]( > https://docs.mongodb.com/manual/replication/). > > > >3. How to add or remove node of replica Set in production env? How > to monitor the health/pressure of replica Set cluster ? > > > >4. Does all measure are supported in ebay=E2=80=99s New Kylin Strea= ming > Solution? What about count distinct(bitmap)? > > > >5. It seems ebay=E2=80=99s New Kylin Streaming Solution use a custo= m > columnar storage, why not use a open source mature columnar storage > solution ? Have your ever compare the performance of your custom columnar > storage to open source columnar storage solution ? > > > > > > > >---------------- > >Best wishes, > >Xiaoxiang Yu > > > > > >=E5=8F=91=E4=BB=B6=E4=BA=BA: Ma Gang > >=E7=AD=94=E5=A4=8D: "dev@kylin.apache.org" > >=E6=97=A5=E6=9C=9F: 2018=E5=B9=B410=E6=9C=8830=E6=97=A5 =E6=98=9F=E6=9C= =9F=E4=BA=8C 15:24 > >=E6=94=B6=E4=BB=B6=E4=BA=BA: "dev@kylin.apache.org" > >=E4=B8=BB=E9=A2=98: [DISCUSS] New Kylin Streaming Solution From eBay > > > >Hi all, > > > >eBay Kylin team has developed a new Kylin streaming solution, the basic > idea is to build a streaming cluster to ingest data from streaming > source(Kafka), and provide query for real-time data, the data preparation > latency is milliseconds, which means the data is queryable almost when it > is ingested, attach is the architecture design doc. > >We would like to contribute the feature to community, please let us know > if you have any concern. > > > >Thanks, > >Gang(Allen) Ma > > > > > > > > > > > --=20 Best regards, Shaofeng Shi =E5=8F=B2=E5=B0=91=E9=94=8B --000000000000246a3f05799517d3--