kylin-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shaofeng SHI (JIRA)" <>
Subject [jira] [Commented] (KYLIN-3487) Create a new measure for precise count distinct
Date Sat, 11 Aug 2018 01:55:00 GMT


Shaofeng SHI commented on KYLIN-3487:

Hi Yanghong,

Did you try the "Segment Dictionary" in Kylin? It is for bitmap calculation that won't across
partitions (days). Seems aiming for the same problem.


You can find the description in

> Create a new measure for precise count distinct
> -----------------------------------------------
>                 Key: KYLIN-3487
>                 URL:
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
> To compute the precise count distinct, we can use bitmap and global dictionary. However,
there's a limitation for the global dictionary. It maps from values to ids whose type is integer,
which means the number of ids will be less than 2B. And it's like a Pixiu for which there's
increase but no decrease. 
> In eBay, there's a requirement of calculating precise count distinct of session. The
session cardinality is large and will grow as time goes on. It will not be feasible to use
the global dictionary when its cardinality exceeds the upper bound 2B. How can we deal with
> There's good news that a session never crosses days. With this feature, we don't need
to merge bitmap across days. To calculate precise session cardinality, we can assign each
day a bitmap and directly summarize the cardinalities estimated by each bitmap. No bitmap
merge is needed. 
> To use bitmap for cardinality calculation, we need to map raw data from value to an integer
id, which is achieved by encoding the value with a dictionary. Previously, for the ability
of merging bitmaps from multiple segments, global dictionary is used. However, in this case,
there's no need of bitmap merge, the global dictionary is not needed. 
> And we don't need to filter by or group by session. Then there's no need to map from
value to id and from id to value after the related bitmap is constructed. Therefore, we don't
need to store dictionaries for session. Only the bitmap is enough.
> To deal with segment merge, since bitmaps of each segment are not able to merge to one
bitmap, we use a map for storing multiple bitmaps. In the map, the key is the segment name
and the value is the segment-level bitmap.

This message was sent by Atlassian JIRA

View raw message