Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4ACBB200D08 for ; Wed, 23 Aug 2017 07:10:39 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 495301679F1; Wed, 23 Aug 2017 05:10:39 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 67AF516766D for ; Wed, 23 Aug 2017 07:10:38 +0200 (CEST) Received: (qmail 208 invoked by uid 500); 23 Aug 2017 05:10:31 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 198 invoked by uid 99); 23 Aug 2017 05:10:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Aug 2017 05:10:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 47425C0169 for ; Wed, 23 Aug 2017 05:10:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.401 X-Spam-Level: X-Spam-Status: No, score=-0.401 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id U2AJW1CJncEO for ; Wed, 23 Aug 2017 05:10:30 +0000 (UTC) Received: from mail-wr0-f182.google.com (mail-wr0-f182.google.com [209.85.128.182]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 793155F5B3 for ; Wed, 23 Aug 2017 05:10:29 +0000 (UTC) Received: by mail-wr0-f182.google.com with SMTP id p14so1893011wrg.1 for ; Tue, 22 Aug 2017 22:10:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=gT3JYLnJsoodhi7wwa3zSIG86JZ/q5lmTPYprjUN+g8=; b=nRODzXlJBL+/BOd/8RzhjTaiKNUBPHU2iaRLOFkY0lubSMyKNq5sdbVgzZkuNJM+DT KpsLNLLYrfr/swapvkkYYC/4bB+62qM1hxSSYbRyTJAO23x2nXQA0AiUZcpe+DaiSBuU +VCnnHP2KACrB1NWGF/1RM1+/8M+C1F4mIN8C9iljOSh+0ppoLD7K+fpQUznXZATG4AB 6AsDqgTMyBmH5hg+4i7GkpGAGz9l61Rht7DWB3e9sYstv6RlTyVFZ5ZC6P3KVE5Ysmw6 XZz3bIJKfTv2lprIdxNG/RmAbLkVxMyHmW5br8t9yo95lsFw3qq+EzDZR9sH5J7mc8bU BKmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=gT3JYLnJsoodhi7wwa3zSIG86JZ/q5lmTPYprjUN+g8=; b=U9lhM5Ht7E3D/YB/zJg/VBkY6nd5s8oqSD23B8Q+l5PeGzNISDVmBmKLwGZq62heTO C9A5tbo6BQyomvAX0ucIHmiMDdWbYatie0y3MyVEaGg88LWjK7kl2+l3qsW7w4FBuJdP 0fNpJtZj2obWlXPiW5h92F5bbCT9XfyxtFBaip179I4p/lBAlVpWFbvP23L26vvuCizA UdIYd4MUelT88CE1tCSd0JVgMiVuy+KVEC42E/TlkRqOAx8LNBREmlHNh1EDZ4up8eKr dkLs1rVS2N33/CSvfu5RuW1KFaoafPDDhZ5QG6dTJmZ+BUrQLnCD4Xf7mH3eavrMrud3 SScw== X-Gm-Message-State: AHYfb5hYCczF2Lsh+GqB0Xvqb2fw7Mv23DCvRDVXMBDIEOcra8jgyT0w zztF8sIxE8dn9tsaCR6uis+fQfVyKQQ0 X-Received: by 10.223.179.211 with SMTP id x19mr710610wrd.7.1503465028957; Tue, 22 Aug 2017 22:10:28 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.148.231 with HTTP; Tue, 22 Aug 2017 22:10:28 -0700 (PDT) In-Reply-To: References: From: panfei Date: Wed, 23 Aug 2017 13:10:28 +0800 Message-ID: Subject: Re: How to optimize multiple count( distinct col) in Hive SQL To: user@hive.apache.org Content-Type: multipart/alternative; boundary="f4030438897cfbbbba055764baa4" archived-at: Wed, 23 Aug 2017 05:10:39 -0000 --f4030438897cfbbbba055764baa4 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Gopal, Thanks for all the information and suggestion. The Hive version is 2.0.1 and use Hive-on-MR as the execution engine. I think I should create a intermediate table which includes all the dimensions (including the serval kinds of ids), and then use spark-sql to calculate the distinct values separately (spark sql is really fast so ~~). thanks again. 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan : > > COUNT(DISTINCT monthly_user_id) AS monthly_active_users, > > COUNT(DISTINCT weekly_user_id) AS weekly_active_users, > =E2=80=A6 > > GROUPING_ID() AS gid, > > COUNT(1) AS dummy > > There are two things which prevent Hive from optimize multiple count > distincts. > > Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE. > > The multiple count distincts are rewritten into a ROLLUP internally by th= e > CBO. > > https://issues.apache.org/jira/browse/HIVE-10901 > > A single count distinct + other aggregates (like > min,max,count,count_distinct in 1 pass) is fixed via > > https://issues.apache.org/jira/browse/HIVE-16654 > > There's no optimizer rule to combine both those scenarios. > > https://issues.apache.org/jira/browse/HIVE-15045 > > There's a possibility that you're using Hive-1.x release branch the CBO > doesn't kick in unless column stats are present, but in the Hive-2.x seri= es > you'll notice that some of these optimizations are not driven by a cost > function and are always applied if CBO is enabled. > > > is there any way to rewrite it to optimize the memory usage. > > If you want it to run through very slowly without errors, you can try > disabling all in-memory aggregations. > > set hive.map.aggr=3Dfalse; > > Cheers, > Gopal > > > --=20 =E4=B8=8D=E5=AD=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93 --f4030438897cfbbbba055764baa4 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Gopal, Thanks for all the information and suggestion.
The Hive version is 2.0.1 and use=C2=A0Hive-on-MR as the = execution engine.

I think I should create a interm= ediate table which includes all the dimensions (including the serval kinds = of ids), and then use spark-sql to calculate the distinct values separately= (spark sql is really fast so ~~).

thanks again.

2017-08= -23 12:56 GMT+08:00 Gopal Vijayaraghavan <gopalv@apache.org>= :
> COUNT(DISTINCT mo= nthly_user_id) AS monthly_active_users,
> COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
=E2=80=A6
> GROUPING_ID() AS gid,
> COUNT(1) AS dummy

There are two things which prevent Hive from optimize multiple count= distincts.

Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.
The multiple count distincts are rewritten into a ROLLUP internally by the = CBO.

https://issues.apache.org/jira/browse/HIVE-1090= 1

A single count distinct + other aggregates (like min,max,count,count_distin= ct in 1 pass) is fixed via

https://issues.apache.org/jira/browse/HIVE-1665= 4

There's no optimizer rule to combine both those scenarios.

https://issues.apache.org/jira/browse/HIVE-1504= 5

There's a possibility that you're using Hive-1.x release branch the= CBO doesn't kick in unless column stats are present, but in the Hive-2= .x series you'll notice that some of these optimizations are not driven= by a cost function and are always applied if CBO is enabled.

> is there any way to rewrite it to optimize the memory usage.

If you want it to run through very slowly without errors, you can tr= y disabling all in-memory aggregations.

set hive.map.aggr=3Dfalse;

Cheers,
Gopal





--
=E4=B8=8D=E5=AD=A6= =E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93
--f4030438897cfbbbba055764baa4--