Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
In-Reply-To: <D49E51B3-B4E1-44FB-BA71-FEFAAE6F767E@hortonworks.com>
References: <CA+JstLD7wxk6UbOdNDviQf2b=7wCzkeEjGZ4AhZFwC1daoco7A@mail.gmail.com>
 <CA+JstLD-vNvN2VGgdqiqFX1cpOpLsOs_c9dg5pzixCJNT+qqtQ@mail.gmail.com>
 <CA+JstLBNOdCxhvQ3HaZm4vB2-Ov=z6y_C=jRB_BKNHzZL5GhMQ@mail.gmail.com> <D49E51B3-B4E1-44FB-BA71-FEFAAE6F767E@hortonworks.com>
From: panfei <cnweike@gmail.com>
Date: Wed, 23 Aug 2017 13:10:28 +0800
Message-ID: <CA+JstLDHyhnm05zaE-Op57faLBgyQE84zC2Za5COoiROS2LPtQ@mail.gmail.com>
Subject: Re: How to optimize multiple count( distinct col) in Hive SQL
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary="f4030438897cfbbbba055764baa4"
archived-at: Wed, 23 Aug 2017 05:10:39 -0000

--f4030438897cfbbbba055764baa4
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Gopal, Thanks for all the information and suggestion.

The Hive version is 2.0.1 and use Hive-on-MR as the execution engine.

I think I should create a intermediate table which includes all the
dimensions (including the serval kinds of ids), and then use spark-sql to
calculate the distinct values separately (spark sql is really fast so ~~).

thanks again.

2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan <gopalv@apache.org>:

> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
> =E2=80=A6
> > GROUPING_ID() AS gid,
> > COUNT(1) AS dummy
>
> There are two things which prevent Hive from optimize multiple count
> distincts.
>
> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.
>
> The multiple count distincts are rewritten into a ROLLUP internally by th=
e
> CBO.
>
> https://issues.apache.org/jira/browse/HIVE-10901
>
> A single count distinct + other aggregates (like
> min,max,count,count_distinct in 1 pass) is fixed via
>
> https://issues.apache.org/jira/browse/HIVE-16654
>
> There's no optimizer rule to combine both those scenarios.
>
> https://issues.apache.org/jira/browse/HIVE-15045
>
> There's a possibility that you're using Hive-1.x release branch the CBO
> doesn't kick in unless column stats are present, but in the Hive-2.x seri=
es
> you'll notice that some of these optimizations are not driven by a cost
> function and are always applied if CBO is enabled.
>
> > is there any way to rewrite it to optimize the memory usage.
>
> If you want it to run through very slowly without errors, you can try
> disabling all in-memory aggregations.
>
> set hive.map.aggr=3Dfalse;
>
> Cheers,
> Gopal
>
>
>


--=20
=E4=B8=8D=E5=AD=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93

--f4030438897cfbbbba055764baa4
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Gopal, Thanks for all the information and suggestion.<d=
iv><br></div><div>The Hive version is 2.0.1 and use=C2=A0Hive-on-MR as the =
execution engine.</div><div><br></div><div>I think I should create a interm=
ediate table which includes all the dimensions (including the serval kinds =
of ids), and then use spark-sql to calculate the distinct values separately=
 (spark sql is really fast so ~~).</div><div><br></div><div>thanks again.</=
div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2017-08=
-23 12:56 GMT+08:00 Gopal Vijayaraghavan <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:gopalv@apache.org" target=3D"_blank">gopalv@apache.org</a>&gt;</span>=
:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le=
ft:1px #ccc solid;padding-left:1ex"><span class=3D"">&gt; COUNT(DISTINCT mo=
nthly_user_id) AS monthly_active_users,<br>
&gt; COUNT(DISTINCT weekly_user_id) AS weekly_active_users,<br>
</span>=E2=80=A6<br>
<span class=3D"">&gt; GROUPING_ID() AS gid,<br>
&gt; COUNT(1) AS dummy<br>
<br>
</span>There are two things which prevent Hive from optimize multiple count=
 distincts.<br>
<br>
Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.<br=
>
<br>
The multiple count distincts are rewritten into a ROLLUP internally by the =
CBO.<br>
<br>
<a href=3D"https://issues.apache.org/jira/browse/HIVE-10901" rel=3D"norefer=
rer" target=3D"_blank">https://issues.apache.org/<wbr>jira/browse/HIVE-1090=
1</a><br>
<br>
A single count distinct + other aggregates (like min,max,count,count_distin=
ct in 1 pass) is fixed via<br>
<br>
<a href=3D"https://issues.apache.org/jira/browse/HIVE-16654" rel=3D"norefer=
rer" target=3D"_blank">https://issues.apache.org/<wbr>jira/browse/HIVE-1665=
4</a><br>
<br>
There&#39;s no optimizer rule to combine both those scenarios.<br>
<br>
<a href=3D"https://issues.apache.org/jira/browse/HIVE-15045" rel=3D"norefer=
rer" target=3D"_blank">https://issues.apache.org/<wbr>jira/browse/HIVE-1504=
5</a><br>
<br>
There&#39;s a possibility that you&#39;re using Hive-1.x release branch the=
 CBO doesn&#39;t kick in unless column stats are present, but in the Hive-2=
.x series you&#39;ll notice that some of these optimizations are not driven=
 by a cost function and are always applied if CBO is enabled.<br>
<span class=3D""><br>
&gt; is there any way to rewrite it to optimize the memory usage.<br>
<br>
</span>If you want it to run through very slowly without errors, you can tr=
y disabling all in-memory aggregations.<br>
<br>
set hive.map.aggr=3Dfalse;<br>
<br>
Cheers,<br>
Gopal<br>
<br>
<br>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature" data-smartmail=3D"gmail_signature">=E4=B8=8D=E5=AD=A6=
=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93<br></div>
</div>

--f4030438897cfbbbba055764baa4--