spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6006) Optimize count distinct in case of high cardinality columns
Date Thu, 17 Dec 2015 06:43:46 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061594#comment-15061594
] 

Yin Huai commented on SPARK-6006:
---------------------------------

SPARK-12077 fixed it.

> Optimize count distinct in case of high cardinality columns
> -----------------------------------------------------------
>
>                 Key: SPARK-6006
>                 URL: https://issues.apache.org/jira/browse/SPARK-6006
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.1.1, 1.2.1
>            Reporter: Yash Datta
>            Assignee: Davies Liu
>            Priority: Minor
>             Fix For: 1.6.0
>
>
> In case there are a lot of distinct values, count distinct becomes too slow since it
tries to hash partial results to one map. It can be improved by creating buckets/partial maps
in an intermediate stage where same key from multiple partial maps of first stage hash to
the same bucket. Later we can sum the size of these buckets to get total distinct count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message