kylin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiaoxiang Yu" <x...@apache.org>
Subject Re:Kylin Building Engine With SparkSql & Parquet
Date Mon, 20 Jan 2020 05:58:09 GMT
Great news! 
I can foresee Kylin could be in a more Cloud-Native way after the mature of parquet storage.
And I wish the developer team will share more detail for its desgin.




--

Best wishes to you ! 
From :Xiaoxiang Yu



At 2020-01-19 22:22:30, "George Ni" <nic@apache.org> wrote:
>Hi Kylin users & developers,
>
>By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
>achieve better performance and it does run much faster compared to MR
>engine. Also Hbase has been Kylin’s trustful storage engine since Kylin was
>born and it has been proved to be a success for providing the ability to
>handle high concurrency queries in extremely large data scale with low
>latency. But there are also limitations for HBase, such as filtering is not
>flexible as we could only filter by RowKey, measures are usually combined
>together which causes more data to be scanned than requested.
>
>
>
>So in order to optimize Kylin in both building strategy and storage engine,
>development team of Kyligence is introducing a new cube building engine
>which uses Spark Sql to construct cuboids with a new strategy and stores
>cube results in Parquet files. The building strategy allows Kylin to build
>cuboids in a smarter way by choosing and building on the optimal cuboid
>source. And Parquet, a columnar storage format available to any project in
>the Hadoop ecosystem, will power the filtering ability with the page-level
>column index and reduce I/O by saving measures in different columns. Also
>with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
>Cloud Native way. More information on design and technique details will
>come soon.
>
>
>
>Below is the comparison in building duration and size of results between
>By-layer Spark Cubing and the new cubing strategy.
>
>
>
>Environment
>
>4-nodes Hadoop cluster
>
>YRAN has 400GB RAM and 128 cores in total;
>
>CDH 5.1, Apache Kylin 3.0.
>
>
>
>Spark
>
>Spark 2.4.1-kylin-r17
>
>
>
>Test Data
>
>SSB data
>
>Cube: 15 dimensions, 3 measures (SUM)
>
>
>
>Test Scenarios
>
>Build the cube at different source size level: 30 million, 60 million
>source rows; Compare the build time with Spark (by layer) + Hbase and
>SparkSql + Parquet.
>
>
>Besides, we attempt to resolve many drawbacks in current query engine,
>which relies heavily on Apache Calcite, such as the performance bottleneck
>in aggregating large query results which currently can only be operated by
>a single worker. By embracing SparkSql, this kind of expensive computing
>can be done distributedly. Also combined with Parquet format, plenty of
>filtering optimizations could be applied,which will boost Kylin’s query
>performance significantly. The features will be open source along with
>technique details in the near future.
>
>
>
>   - https://issues.apache.org/jira/browse/KYLIN-4188
>
>
>-- 
>
>---------------------
>
>Best regards,
>
>
>
>Ni Chunen / George
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message