A simple and parallel way to do this is by breaking the data into ranges or hashes then do distinct counting on those. Hive should do something like this automatically.

This is a rather naive way.

SELECT column from source_table_0 where row_key mod 10 = 0;
SELECT column from source_table_1 where row_key mod 10 = 1;

create table all as
select count(dstinct) from source_table_0
union all
select count(distinct) from source_table_1

select count(*) from all;




On Wed, Aug 6, 2014 at 10:23 AM, Sergey Murylev <sergeymurylev@gmail.com> wrote:

Why do you think that default implementation of COUNT DISTINCT is slow? As far as I understand the most famous way to find number of distinct elements is to sort them and scan all sorted items consequently excluding duplicated elements. Assimptotics of this algoritm is O(n *log n ), I think that there is no way to do this faster in general case. I think that Hive should use map-reduce sort stage to make items sorted, but probably in your case we have only one reduce task because we need to aggregate result on single instance.
06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN - IN/Bangalore)" <prabakaran.1.natarajan@nsn.com> написал:
>
> Hi
>  
> I am looking for high performance count distinct solution on Hive Query.
>  
> Regular count distinct is very slow but if I use probabilistic count distinct has more error percentage (if the number of records are small).
>  
>  
> Is there is any solution to have exact count distinct but using low memory and without error?
>  
> Thanks and Regards
> Prabakaran.N   
>  
>  
>