kylin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ShaoFeng Shi <shaofeng...@apache.org>
Subject Re: Kylin Building Engine With SparkSql & Parquet
Date Mon, 20 Jan 2020 08:00:53 GMT
Hi, Chun en,

Thanks for the information. What's the detailed release plan of this
feature to the community?

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Xiaoxiang Yu <xxyu@apache.org> 于2020年1月20日周一 下午1:59写道:

> Great news!
> I can foresee Kylin could be in a more Cloud-Native way after the mature
> of parquet storage. And I wish the developer team will share more detail
> for its desgin.
>
>
>
>
> --
>
> Best wishes to you !
> From :Xiaoxiang Yu
>
>
>
> At 2020-01-19 22:22:30, "George Ni" <nic@apache.org> wrote:
> >Hi Kylin users & developers,
> >
> >By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
> >achieve better performance and it does run much faster compared to MR
> >engine. Also Hbase has been Kylin’s trustful storage engine since Kylin
> was
> >born and it has been proved to be a success for providing the ability to
> >handle high concurrency queries in extremely large data scale with low
> >latency. But there are also limitations for HBase, such as filtering is
> not
> >flexible as we could only filter by RowKey, measures are usually combined
> >together which causes more data to be scanned than requested.
> >
> >
> >
> >So in order to optimize Kylin in both building strategy and storage
> engine,
> >development team of Kyligence is introducing a new cube building engine
> >which uses Spark Sql to construct cuboids with a new strategy and stores
> >cube results in Parquet files. The building strategy allows Kylin to build
> >cuboids in a smarter way by choosing and building on the optimal cuboid
> >source. And Parquet, a columnar storage format available to any project in
> >the Hadoop ecosystem, will power the filtering ability with the page-level
> >column index and reduce I/O by saving measures in different columns. Also
> >with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
> >Cloud Native way. More information on design and technique details will
> >come soon.
> >
> >
> >
> >Below is the comparison in building duration and size of results between
> >By-layer Spark Cubing and the new cubing strategy.
> >
> >
> >
> >Environment
> >
> >4-nodes Hadoop cluster
> >
> >YRAN has 400GB RAM and 128 cores in total;
> >
> >CDH 5.1, Apache Kylin 3.0.
> >
> >
> >
> >Spark
> >
> >Spark 2.4.1-kylin-r17
> >
> >
> >
> >Test Data
> >
> >SSB data
> >
> >Cube: 15 dimensions, 3 measures (SUM)
> >
> >
> >
> >Test Scenarios
> >
> >Build the cube at different source size level: 30 million, 60 million
> >source rows; Compare the build time with Spark (by layer) + Hbase and
> >SparkSql + Parquet.
> >
> >
> >Besides, we attempt to resolve many drawbacks in current query engine,
> >which relies heavily on Apache Calcite, such as the performance bottleneck
> >in aggregating large query results which currently can only be operated by
> >a single worker. By embracing SparkSql, this kind of expensive computing
> >can be done distributedly. Also combined with Parquet format, plenty of
> >filtering optimizations could be applied,which will boost Kylin’s query
> >performance significantly. The features will be open source along with
> >technique details in the near future.
> >
> >
> >
> >   - https://issues.apache.org/jira/browse/KYLIN-4188
> >
> >
> >--
> >
> >---------------------
> >
> >Best regards,
> >
> >
> >
> >Ni Chunen / George
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message