Adding user list too.



---------- Forwarded message ----------
From: Reynold Xin <rxin@databricks.com>
Date: Tue, Oct 6, 2015 at 5:54 PM
Subject: Re: multiple count distinct in SQL/DataFrame?
To: "dev@spark.apache.org" <dev@spark.apache.org>


To provide more context, if we do remove this feature, the following SQL query would throw an AnalysisException:

select count(distinct colA), count(distinct colB) from foo;

The following should still work:

select count(distinct colA) from foo;

The following should also work:

select count(distinct colA, colB) from foo;


On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <rxin@databricks.com> wrote:
The current implementation of multiple count distinct in a single query is very inferior in terms of performance and robustness, and it is also hard to guarantee correctness of the implementation in some of the refactorings for Tungsten. Supporting a better version of it is possible in the future, but will take a lot of engineering efforts. Most other Hadoop-based SQL systems (e.g. Hive, Impala) don't support this feature.

As a result, we are considering removing support for multiple count distinct in a single query in the next Spark release (1.6). If you use this feature, please reply to this email. Thanks.

Note that if you don't care about null values, it is relatively easy to reconstruct a query using joins to support multiple distincts.