beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jingsong Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2478) Distinct Aggregates
Date Fri, 23 Jun 2017 03:06:04 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060353#comment-16060353
] 

Jingsong Lee commented on BEAM-2478:
------------------------------------

Count(Distinct) is a very interesting function.
It needs operator to count with the details of distinct field. This state is very huge sometimes.
There are three solutions as far as I know:
1.Count with all details of distinct field: I think we can use StatefulParDo with ValueState(Count)
and SetState(For Distinct).
2.Approximation algorithm: cardinality(HyperLogLog) or bloomFilter or Bitmap. This can greatly
reduce the amount of State data, but will lead to inaccurate. Apache Kylin use this.
3.Hierarchical calculation: 
select a, count(distinct b) from t group by a; -----> select a, count(1) from (select a,
count(1) group by a,b) t2 group by a;
First operator distinct by b(also can do some local aggregate by a, will reduce the shuffle
data) and second operator count by a. This can effectively reduce the state data, ease data
skew. Apache Impala use this.

> Distinct Aggregates
> -------------------
>
>                 Key: BEAM-2478
>                 URL: https://issues.apache.org/jira/browse/BEAM-2478
>             Project: Beam
>          Issue Type: New Feature
>          Components: dsl-sql
>            Reporter: Jingsong Lee
>            Assignee: Tarush Grover
>
> eg: COUNT(DISTINCT empno)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message