Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9E857116E1 for ; Wed, 6 Aug 2014 21:52:11 +0000 (UTC) Received: (qmail 30738 invoked by uid 500); 6 Aug 2014 21:52:07 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 30595 invoked by uid 500); 6 Aug 2014 21:52:07 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 30584 invoked by uid 99); 6 Aug 2014 21:52:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Aug 2014 21:52:07 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com designates 74.125.82.46 as permitted sender) Received: from [74.125.82.46] (HELO mail-wg0-f46.google.com) (74.125.82.46) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Aug 2014 21:52:04 +0000 Received: by mail-wg0-f46.google.com with SMTP id m15so3234586wgh.29 for ; Wed, 06 Aug 2014 14:51:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=93RKBRh43yPvTngbfZ6Zk0W0NwPeqMXcecsGhXWQiG8=; b=NTZgqQZ+gFO6sjG9naUcAKOGTEDhrs7KWF0tR0Q7OQyYqfV+RjQsQXZKOhwoIX3ZQa iERiOkc3WYRO5jVIN73tXx26GPUjhg1smRxvHjPHJXMSA4ENqigtjeDNKgA0Lr3D4mh5 wYVcjJ/Vnkep7BY/g+zODBMKPIwWxQ+8zuN2JqAG5LJ6YC+rkpJtuIkfMMYu32iI2y3Z cmtZ02G4entInv27y//2DX7TdlI9gtgeqlRJUc8nFC57WiwiUL4s+Ah1bO7h/FhTru6o kdIM04BCo94gWHu9JlZzNdhb0IjPTtXwzZpLzsoMcJEITYRoXeK7r+NzvfMxVVEsMDUq 1DYA== MIME-Version: 1.0 X-Received: by 10.180.185.228 with SMTP id ff4mr19695366wic.76.1407361900415; Wed, 06 Aug 2014 14:51:40 -0700 (PDT) Received: by 10.194.88.100 with HTTP; Wed, 6 Aug 2014 14:51:40 -0700 (PDT) In-Reply-To: References: Date: Wed, 6 Aug 2014 17:51:40 -0400 Message-ID: Subject: Re: High performance Count Distinct - NO Error From: Edward Capriolo To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a11c350562510fd04fffcfad6 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c350562510fd04fffcfad6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable A simple and parallel way to do this is by breaking the data into ranges or hashes then do distinct counting on those. Hive should do something like this automatically. This is a rather naive way. SELECT column from source_table_0 where row_key mod 10 =3D 0; SELECT column from source_table_1 where row_key mod 10 =3D 1; create table all as select count(dstinct) from source_table_0 union all select count(distinct) from source_table_1 select count(*) from all; On Wed, Aug 6, 2014 at 10:23 AM, Sergey Murylev wrote: > Why do you think that default implementation of COUNT DISTINCT is slow? A= s > far as I understand the most famous way to find number of distinct elemen= ts > is to sort them and scan all sorted items consequently excluding duplicat= ed > elements. Assimptotics of this algoritm is O(n *log n ), I think that the= re > is no way to do this faster in general case. I think that Hive should use > map-reduce sort stage to make items sorted, but probably in your case we > have only one reduce task because we need to aggregate result on single > instance. > 06 =D0=B0=D0=B2=D0=B3. 2014 =D0=B3. 12:54 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7= =D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C "Natarajan, Prabakaran 1. (NSN - > IN/Bangalore)" =D0=BD=D0=B0=D0=BF=D0=B8= =D1=81=D0=B0=D0=BB: > > > > Hi > > > > I am looking for high performance count distinct solution on Hive Query= . > > > > Regular count distinct is very slow but if I use probabilistic count > distinct has more error percentage (if the number of records are small). > > > > > > Is there is any solution to have exact count distinct but using low > memory and without error? > > > > Thanks and Regards > > Prabakaran.N > > > > > > > --001a11c350562510fd04fffcfad6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
A simple and parallel way to do t= his is by breaking the data into ranges or hashes then do distinct counting= on those. Hive should do something like this automatically.

This is a rather naive way.

SELECT column from sour= ce_table_0 where row_key mod 10 =3D 0;
SELECT column from source_table_1= where row_key mod 10 =3D 1;

create table all as
select count(dstinct) from source_table_0
union all
select coun= t(distinct) from source_table_1

select count(*) from all;
<= div>



On Wed, Aug 6, 2014 at 10:23 AM, Sergey = Murylev <sergeymurylev@gmail.com> wrote:

Why do you think that default implementation of COUNT DISTIN= CT is slow? As far as I understand the most famous way to find number of di= stinct elements is to sort them and scan all sorted items consequently excl= uding duplicated elements. Assimptotics of this algoritm is O(n *log n ), I= think that there is no way to do this faster in general case. I think that= Hive should use map-reduce sort stage to make items sorted, but probably i= n your case we have only one reduce task because we need to aggregate resul= t on single instance.
06 =D0=B0=D0=B2=D0=B3. 2014 =D0=B3. 12:54 =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0= =BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C "Natarajan, Prabakaran 1. (NSN= - IN/Bangalore)" <prabakaran.1.natarajan@nsn.com> =D0=BD=D0=B0=D0= =BF=D0=B8=D1=81=D0=B0=D0=BB:
>
> Hi
> =C2=A0
> I am looking for high performance count distinct solution on Hive Quer= y.
> =C2=A0
> Regular count distinct is very slow but if I use probabilistic count d= istinct has more error percentage (if the number of records are small).
> =C2=A0
> =C2=A0
> Is there is any solution to have exact count distinct but using low me= mory and without error?
> =C2=A0
> Thanks and Regards
> Prabakaran.N=C2=A0 =C2=A0
> =C2=A0
> =C2=A0
> =C2=A0


--001a11c350562510fd04fffcfad6--