hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Murylev <sergeymury...@gmail.com>
Subject Re: High performance Count Distinct - NO Error
Date Wed, 06 Aug 2014 14:23:54 GMT
Why do you think that default implementation of COUNT DISTINCT is slow? As
far as I understand the most famous way to find number of distinct elements
is to sort them and scan all sorted items consequently excluding duplicated
elements. Assimptotics of this algoritm is O(n *log n ), I think that there
is no way to do this faster in general case. I think that Hive should use
map-reduce sort stage to make items sorted, but probably in your case we
have only one reduce task because we need to aggregate result on single
instance.
06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
IN/Bangalore)" <prabakaran.1.natarajan@nsn.com> написал:
>
> Hi
>
> I am looking for high performance count distinct solution on Hive Query.
>
> Regular count distinct is very slow but if I use probabilistic count
distinct has more error percentage (if the number of records are small).
>
>
> Is there is any solution to have exact count distinct but using low
memory and without error?
>
> Thanks and Regards
> Prabakaran.N
>
>
>

Mime
View raw message