spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Fwd: multiple count distinct in SQL/DataFrame?
Date Wed, 07 Oct 2015 20:43:35 GMT
Adding user list too.



---------- Forwarded message ----------
From: Reynold Xin <rxin@databricks.com>
Date: Tue, Oct 6, 2015 at 5:54 PM
Subject: Re: multiple count distinct in SQL/DataFrame?
To: "dev@spark.apache.org" <dev@spark.apache.org>


To provide more context, if we do remove this feature, the following SQL
query would throw an AnalysisException:

select count(distinct colA), count(distinct colB) from foo;

The following should still work:

select count(distinct colA) from foo;

The following should also work:

select count(distinct colA, colB) from foo;


On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <rxin@databricks.com> wrote:

> The current implementation of multiple count distinct in a single query is
> very inferior in terms of performance and robustness, and it is also hard
> to guarantee correctness of the implementation in some of the refactorings
> for Tungsten. Supporting a better version of it is possible in the future,
> but will take a lot of engineering efforts. Most other Hadoop-based SQL
> systems (e.g. Hive, Impala) don't support this feature.
>
> As a result, we are considering removing support for multiple count
> distinct in a single query in the next Spark release (1.6). If you use this
> feature, please reply to this email. Thanks.
>
> Note that if you don't care about null values, it is relatively easy to
> reconstruct a query using joins to support multiple distincts.
>
>
>

Mime
View raw message