Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of sergeymurylev@gmail.com
 designates 209.85.213.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <DD08C21E8C680641B67C6C279273A5350F0BFA19@SGSIMBX001.nsn-intra.net>
References: 
 <DD08C21E8C680641B67C6C279273A5350F0BFA19@SGSIMBX001.nsn-intra.net>
Date: Wed, 6 Aug 2014 18:23:54 +0400
Message-ID: 
 <CACYD1LL-zwq1TEmxN+_-pkdnwufQXkUhvD8MVoHiW40tzZ5png@mail.gmail.com>
Subject: Re: High performance Count Distinct - NO Error
From: Sergey Murylev <sergeymurylev@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf30050c58d11e0904fff6b856

--20cf30050c58d11e0904fff6b856
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Why do you think that default implementation of COUNT DISTINCT is slow? As
far as I understand the most famous way to find number of distinct elements
is to sort them and scan all sorted items consequently excluding duplicated
elements. Assimptotics of this algoritm is O(n *log n ), I think that there
is no way to do this faster in general case. I think that Hive should use
map-reduce sort stage to make items sorted, but probably in your case we
have only one reduce task because we need to aggregate result on single
instance.
06 =D0=B0=D0=B2=D0=B3. 2014 =D0=B3. 12:54 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=
=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C "Natarajan, Prabakaran 1. (NSN -
IN/Bangalore)" <prabakaran.1.natarajan@nsn.com> =D0=BD=D0=B0=D0=BF=D0=B8=D1=
=81=D0=B0=D0=BB:
>
> Hi
>
> I am looking for high performance count distinct solution on Hive Query.
>
> Regular count distinct is very slow but if I use probabilistic count
distinct has more error percentage (if the number of records are small).
>
>
> Is there is any solution to have exact count distinct but using low
memory and without error?
>
> Thanks and Regards
> Prabakaran.N
>
>
>

--20cf30050c58d11e0904fff6b856
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Why do you think that default implementation of COUNT DISTIN=
CT is slow? As far as I understand the most famous way to find number of di=
stinct elements is to sort them and scan all sorted items consequently excl=
uding duplicated elements. Assimptotics of this algoritm is O(n *log n ), I=
 think that there is no way to do this faster in general case. I think that=
 Hive should use map-reduce sort stage to make items sorted, but probably i=
n your case we have only one reduce task because we need to aggregate resul=
t on single instance. <br>

06 =D0=B0=D0=B2=D0=B3. 2014 =D0=B3. 12:54 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=
=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C &quot;Natarajan, Prabakaran 1. (NSN=
 - IN/Bangalore)&quot; &lt;<a href=3D"mailto:prabakaran.1.natarajan@nsn.com=
">prabakaran.1.natarajan@nsn.com</a>&gt; =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=
=B0=D0=BB:<br>
&gt;<br>
&gt; Hi<br>
&gt; =C2=A0<br>
&gt; I am looking for high performance count distinct solution on Hive Quer=
y.<br>
&gt; =C2=A0<br>
&gt; Regular count distinct is very slow but if I use probabilistic count d=
istinct has more error percentage (if the number of records are small).<br>
&gt; =C2=A0<br>
&gt; =C2=A0<br>
&gt; Is there is any solution to have exact count distinct but using low me=
mory and without error?<br>
&gt; =C2=A0<br>
&gt; Thanks and Regards<br>
&gt; Prabakaran.N=C2=A0 =C2=A0<br>
&gt; =C2=A0<br>
&gt; =C2=A0<br>
&gt; =C2=A0</p>

--20cf30050c58d11e0904fff6b856--