kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JiaTao Tao <>
Subject Re: Re: Evaluate Kylin on Parquet
Date Wed, 19 Dec 2018 08:45:22 GMT
Hi Gang

In my opinion, segments/partition pruning is actually in the scope of
"Index system", we can have an "Index system" in storage level including
File index(for segment/partition pruning), page index(for page pruning)
etc. We can put all these stuff in such a system and make the separation of
duties cleaner.

Ma Gang <> 于2018年12月19日周三 上午6:31写道:

> Awesome! Looking forward to the improvement. For dictionary, keep the
> dictionary in query engine, most time is not good since it brings lots of
> pressure to Kylin server, but sometimes it has benefit, for example, some
> segments can be pruned very early when filter value is not in the
> dictionary, and some queries can be answer directly using dictionary as
> described in:
> At 2018-12-17 15:36:01, "ShaoFeng Shi" <> wrote:
> The dimension dictionary is a legacy design for HBase storage I think;
> because HBase has no data type, everything is a byte array, this makes
> Kylin has to encode STRING and other types with some encoding method like
> the dictionary.
> Now with the storage like Parquet, it would decide how to encode the data
> at the page or block level. Then we can drop the dictionary after the cube
> is built. This will release the memory pressure of Kylin query nodes and
> also benefit the UHC case.
> Best regards,
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Work email:
> Kyligence Inc:
> Apache Kylin FAQ:
> Join Kylin user mail group:
> Join Kylin dev mail group:
> Chao Long <> 于2018年12月17日周一 下午1:23写道:
>>  In this PoC, we verified Kylin On Parquet is viable, but the query
>> performance still have room to improve. We can improve it from the
>> following aspects:
>>  1, Minimize result set serialization time
>>  Since Kylin need Object[] data to process, we convert Dataset to RDD,
>> and then convert the "Row" type to Object[], so Spark need to serialize
>> Object[] before return it to driver. Those time need to be avoided.
>>  2, Query without dictionary
>>  In this PoC, for less storage use, we keep dict encode value in Parquet
>> file for dict-encode dimensions, so Kylin must load dictionary to convert
>> dict value for query. If we keep original value for dict-encode dimension,
>> dictionary is unnecessary. And we don't hava to worry about the storage
>> use, because Parquet will encode it. We should remove dictionary from query.
>>  3, Remove query single-point issue
>>  In this PoC, we use Spark to read and process Cube data, which is
>> distributed, but kylin alse need to process result data the Spark returned
>> in single jvm. We can try to make it distributed too.
>>  4, Upgrade Parquet to 1.11 for page index
>>  In this PoC, Parquet don't have page index, we get a poor filter
>> performance. We need to upgrade Parquet to version 1.11 which has page
>> index to improve filter performance.
>> ------------------
>> Best Regards,
>> Chao Long
>> ------------------ 原始邮件 ------------------
>> *发件人:* "ShaoFeng Shi"<>;
>> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> *收件人:* "dev"<>;"user"<>;
>> *主题:* Evaluate Kylin on Parquet
>> Hello Kylin users,
>> The first version of Kylin on Parquet [1] feature has been staged in
>> Kylin code repository for public review and evaluation. You can check out
>> the "kylin-on-parquet" branch [2] to read the code, and also can make a
>> binary build to run an example. When creating a cube, you can select
>> "Parquet" as the storage in the "Advanced setting" page. Both MapReduce and
>> Spark engines support this new storage. A tech blog is under drafting for
>> the design and implementation.
>> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
>> This is not the final version; there is room to improve in many aspects,
>> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
>> comments are welcomed. Let's improve it together.
>> [1]
>> [2]
>> Best regards,
>> Shaofeng Shi 史少锋
>> Apache Kylin PMC
>> Work email:
>> Kyligence Inc:
>> Apache Kylin FAQ:
>> Join Kylin user mail group:
>> Join Kylin dev mail group:



Aron Tao

View raw message