kylin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiaoxiang Yu <xiaoxiang...@kyligence.io>
Subject Re: [DISCUSS] New Kylin Streaming Solution From eBay
Date Thu, 01 Nov 2018 07:12:31 GMT
Thank you for your reply. Maybe I can help to improve your Kylin Streaming Solution in the
future.


----------------
Best wishes,
Xiaoxiang Yu





On [DATE], "[NAME]" <[ADDRESS]> wrote:



    Thanks Xiaoxiang,

    Very good questions! Please see my comments started with [Gang]:





    1.      Is it possible to use Yarn as cluster manager for index task. Coordinator process
will set up them at specificed period.

    [Gang] I think it is possible, but in current design,  the indexing task is designed as
long running task, it also can provide query service, this makes the whole system very simple
and efficiency, I don't think we need to stop/start indexing task time by time. But use yarn
to manage the resource is possible, we need to redesign the existing coordinator, to make
it easy to deploy to Yarn, Kubernetes, etc. Hope this can be done after contribution to community.



    2.      As I know, ebay’s New Kylin Streaming Solution use replica Set to ensure that
income messages wouldn’t lost if some processes  lost. I think replica set is a set of kafka
cosumer processes which is responsible for ingest message and build base cuboid in memory.
Could you please show me some detail about how replica Set provide HA guarantee? How to configure
it? A link / paper is OK.  I found one but I don’t know if it same meaning for your replica
Set.





    [Gang] Yes, it is similar as the MongoDB replication, but currently we don't replicate
data from Primary node, just assign the same Kafka topic/partitions to the receivers in a
ReplicaSet, all receivers in a ReplicaSet will consume data from Kafka, so if one receiver
is down, other receivers in the ReplicaSet are still consuming the same Kafka data, so the
consume/query will not be impact. And We don't guarantee that the receivers in a ReplicaSet
have the same consuming rate, but we can guarantee that the user can view data consistently
by stick to the query to one receiver for one cube.

    The HA implementation is a little bit naive, but simple and worked. Maybe in the future,
we can do HA by replication to support other streaming sources that don't support multiple
consumers and don't have persistent store.



    3.      How to add or remove node of replica Set in production env? How to monitor the
health/pressure of replica Set cluster ?

    [Gang] Currently we have UI/restful api to let admin to add/remove node to/from a ReplicaSet,
and have a simple ui to let admin monitor the health, consuming rate for each receiver/cube.
Also all metrics are collected using yammer metrics framework, it is easy to exposed to other
monitor system.



    4.      Does all measure are supported in ebay’s New Kylin Streaming Solution? What
about count distinct(bitmap)?

    [Gang] Most measures are supported, but precise count distinct(bitmap) is not support
in case that the distinct dimension is not int type. As you know, to support precise count
distinct for not-int type dimension, it needs to build global dictionary, it is not possible
in the streaming env.





    5.      It seems ebay’s New Kylin Streaming Solution use a custom columnar storage,
why not use a open source mature columnar storage  solution ? Have your ever compare the performance
of your custom columnar storage to open source columnar storage  solution ?



    [Gang] Most open source columnar format like Parquet, ORC are designed to use in Hadoop
env, the streaming data are in local disk, so I didn't consider them at the beginning. It
is not very hard to define columnar format to store Kylin specific data, use a customize columnar
storage, you can use mmap file to scan data, add row-level invert index for all dimensions,
so I think the performance will be better compared to using common columnar format. I didn't
compare the performance, but the storage engine is pluggable, you may contribute a parquet
storage if you are interesting.













    At 2018-11-01 12:42:25, "Xiaoxiang Yu" <xiaoxiang.yu@kyligence.io> wrote:

    >Hi gang, I am so glad to know that eBay has a solution for realtime olap on kylin.
I have some small question:

    >

    >

    >1.      Is it possible to use Yarn as cluster manager for index task. Coordinator
process will set up them at specificed period. Yarn will manage :

    >

    >a)       retry these task if some failed

    >

    >b)       resource allocation

    >

    >c)       log collection

    >

    >2.      As I know, ebay’s New Kylin Streaming Solution use replica Set to ensure
that income messages wouldn’t lost if some processes  lost. I think replica set is a set
of kafka cosumer processes which is responsible for ingest message and build base cuboid in
memory. Could you please show me some detail about how replica Set provide HA guarantee? How
to configure it? A link / paper is OK.  I found one but I don’t know if it same meaning
for your replica Set.

    >

    >a)       [Mongodb replication](https://docs.mongodb.com/manual/replication/).

    >

    >3.      How to add or remove node of replica Set in production env? How to monitor
the health/pressure of replica Set cluster ?

    >

    >4.      Does all measure are supported in ebay’s New Kylin Streaming Solution? What
about count distinct(bitmap)?

    >

    >5.      It seems ebay’s New Kylin Streaming Solution use a custom columnar storage,
why not use a open source mature columnar storage  solution ? Have your ever compare the performance
of your custom columnar storage to open source columnar storage  solution ?

    >

    >

    >

    >----------------

    >Best wishes,

    >Xiaoxiang Yu

    >

    >

    >发件人: Ma Gang <mg4work@163.com>

    >答复: "dev@kylin.apache.org" <dev@kylin.apache.org>

    >日期: 2018年10月30日 星期二 15:24

    >收件人: "dev@kylin.apache.org" <dev@kylin.apache.org>

    >主题: [DISCUSS] New Kylin Streaming Solution From eBay

    >

    >Hi all,

    >

    >eBay Kylin team has developed a new Kylin streaming solution, the basic idea is to
build a streaming cluster to ingest data from streaming source(Kafka), and provide query for
real-time data, the data preparation latency is milliseconds, which means the data is queryable
almost when it is ingested, attach is the architecture design doc.

    >We would like to contribute the feature to community, please let us know if you have
any concern.

    >

    >Thanks,

    >Gang(Allen) Ma

    >

    >

    >

    >

    >


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message