Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com
 designates 74.125.82.46 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACYD1LL-zwq1TEmxN+_-pkdnwufQXkUhvD8MVoHiW40tzZ5png@mail.gmail.com>
References: 
 <DD08C21E8C680641B67C6C279273A5350F0BFA19@SGSIMBX001.nsn-intra.net>
	<CACYD1LL-zwq1TEmxN+_-pkdnwufQXkUhvD8MVoHiW40tzZ5png@mail.gmail.com>
Date: Wed, 6 Aug 2014 17:51:40 -0400
Message-ID: 
 <CAENxBwxjSZR8J0UJ23R=uB-_S0BDJb0Kh8nvXCFjjN7++R3B1A@mail.gmail.com>
Subject: Re: High performance Count Distinct - NO Error
From: Edward Capriolo <edlinuxguru@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a11c350562510fd04fffcfad6

--001a11c350562510fd04fffcfad6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

A simple and parallel way to do this is by breaking the data into ranges or
hashes then do distinct counting on those. Hive should do something like
this automatically.

This is a rather naive way.

SELECT column from source_table_0 where row_key mod 10 =3D 0;
SELECT column from source_table_1 where row_key mod 10 =3D 1;

create table all as
select count(dstinct) from source_table_0
union all
select count(distinct) from source_table_1

select count(*) from all;


On Wed, Aug 6, 2014 at 10:23 AM, Sergey Murylev <sergeymurylev@gmail.com>
wrote:

> Why do you think that default implementation of COUNT DISTINCT is slow? A=
s
> far as I understand the most famous way to find number of distinct elemen=
ts
> is to sort them and scan all sorted items consequently excluding duplicat=
ed
> elements. Assimptotics of this algoritm is O(n *log n ), I think that the=
re
> is no way to do this faster in general case. I think that Hive should use
> map-reduce sort stage to make items sorted, but probably in your case we
> have only one reduce task because we need to aggregate result on single
> instance.
> 06 =D0=B0=D0=B2=D0=B3. 2014 =D0=B3. 12:54 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=
=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C "Natarajan, Prabakaran 1. (NSN -
> IN/Bangalore)" <prabakaran.1.natarajan@nsn.com> =D0=BD=D0=B0=D0=BF=D0=B8=
=D1=81=D0=B0=D0=BB:
> >
> > Hi
> >
> > I am looking for high performance count distinct solution on Hive Query=
.
> >
> > Regular count distinct is very slow but if I use probabilistic count
> distinct has more error percentage (if the number of records are small).
> >
> >
> > Is there is any solution to have exact count distinct but using low
> memory and without error?
> >
> > Thanks and Regards
> > Prabakaran.N
> >
> >
> >
>

--001a11c350562510fd04fffcfad6
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div><div>A simple and parallel way to do t=
his is by breaking the data into ranges or hashes then do distinct counting=
 on those. Hive should do something like this automatically.<br><br></div>
This is a rather naive way.<br></div><div><br></div>SELECT column from sour=
ce_table_0 where row_key mod 10 =3D 0;<br>SELECT column from source_table_1=
 where row_key mod 10 =3D 1;<br><br></div><div>create table all as<br></div=
>
select count(dstinct) from source_table_0<br></div>union all<br>select coun=
t(distinct) from source_table_1<br><br></div>select count(*) from all;<br><=
div><br><div><div><div> <br></div></div></div></div></div><div class=3D"gma=
il_extra">
<br><br><div class=3D"gmail_quote">On Wed, Aug 6, 2014 at 10:23 AM, Sergey =
Murylev <span dir=3D"ltr">&lt;<a href=3D"mailto:sergeymurylev@gmail.com" ta=
rget=3D"_blank">sergeymurylev@gmail.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">
<p dir=3D"ltr">Why do you think that default implementation of COUNT DISTIN=
CT is slow? As far as I understand the most famous way to find number of di=
stinct elements is to sort them and scan all sorted items consequently excl=
uding duplicated elements. Assimptotics of this algoritm is O(n *log n ), I=
 think that there is no way to do this faster in general case. I think that=
 Hive should use map-reduce sort stage to make items sorted, but probably i=
n your case we have only one reduce task because we need to aggregate resul=
t on single instance. <br>


06 =D0=B0=D0=B2=D0=B3. 2014 =D0=B3. 12:54 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=
=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C &quot;Natarajan, Prabakaran 1. (NSN=
 - IN/Bangalore)&quot; &lt;<a href=3D"mailto:prabakaran.1.natarajan@nsn.com=
" target=3D"_blank">prabakaran.1.natarajan@nsn.com</a>&gt; =D0=BD=D0=B0=D0=
=BF=D0=B8=D1=81=D0=B0=D0=BB:<br>
&gt;<br>
&gt; Hi<br>
&gt; =C2=A0<br>
&gt; I am looking for high performance count distinct solution on Hive Quer=
y.<br>
&gt; =C2=A0<br>
&gt; Regular count distinct is very slow but if I use probabilistic count d=
istinct has more error percentage (if the number of records are small).<br>
&gt; =C2=A0<br>
&gt; =C2=A0<br>
&gt; Is there is any solution to have exact count distinct but using low me=
mory and without error?<br>
&gt; =C2=A0<br>
&gt; Thanks and Regards<br>
&gt; Prabakaran.N=C2=A0 =C2=A0<br>
&gt; =C2=A0<br>
&gt; =C2=A0<br>
&gt; =C2=A0</p>
</blockquote></div><br></div>

--001a11c350562510fd04fffcfad6--