Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
In-Reply-To: <CA+JstLA77kEgs2CA3hUviLaxJxKH-dux_YTOij4mR5HrTgNFeg@mail.gmail.com>
References: <CA+JstLD7wxk6UbOdNDviQf2b=7wCzkeEjGZ4AhZFwC1daoco7A@mail.gmail.com>
 <CA+JstLD-vNvN2VGgdqiqFX1cpOpLsOs_c9dg5pzixCJNT+qqtQ@mail.gmail.com>
 <CA+JstLBNOdCxhvQ3HaZm4vB2-Ov=z6y_C=jRB_BKNHzZL5GhMQ@mail.gmail.com>
 <D49E51B3-B4E1-44FB-BA71-FEFAAE6F767E@hortonworks.com> <CA+JstLDHyhnm05zaE-Op57faLBgyQE84zC2Za5COoiROS2LPtQ@mail.gmail.com>
 <CA+JstLA77kEgs2CA3hUviLaxJxKH-dux_YTOij4mR5HrTgNFeg@mail.gmail.com>
From: panfei <cnweike@gmail.com>
Date: Thu, 24 Aug 2017 09:42:48 +0800
Message-ID: <CA+JstLBRLLuP0pg7zX0zYx8n3TDj8G7+uXXpT-KP28coWfkTeg@mail.gmail.com>
Subject: Re: How to optimize multiple count( distinct col) in Hive SQL
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary="001a113bd8fc214621055775f28f"
archived-at: Thu, 24 Aug 2017 01:43:06 -0000

--001a113bd8fc214621055775f28f
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by decreasing mapreduce.reduce.shuffle.parallelcopies from 20 to 5,  it
seems that everything goes well, no OOM ~~

2017-08-23 17:19 GMT+08:00 panfei <cnweike@gmail.com>:

> The full error stack is (which described here : https://issues.apache.org=
/
> jira/browse/MAPREDUCE-6108) :
>
> this error can not reproduce every time, after retry several times, the
> job successfully finished.
>
> 2017-08-23 17:16:03,574 WARN [main] org.apache.hadoop.mapred.YarnChild: E=
xception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$Sh=
uffleError: error in shuffle in fetcher#2
> 	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma=
tion.java:1657)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteA=
rrayOutputStream.java:56)
> 	at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteA=
rrayOutputStream.java:46)
> 	at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMe=
moryMapOutput.java:63)
> 	at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditiona=
lReserve(MergeManagerImpl.java:305)
> 	at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(Merg=
eManagerImpl.java:295)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher=
.java:514)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.=
java:336)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
>
> 2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task: Runnni=
ng cleanup for the task
>
>
> 2017-08-23 13:10 GMT+08:00 panfei <cnweike@gmail.com>:
>
>> Hi Gopal, Thanks for all the information and suggestion.
>>
>> The Hive version is 2.0.1 and use Hive-on-MR as the execution engine.
>>
>> I think I should create a intermediate table which includes all the
>> dimensions (including the serval kinds of ids), and then use spark-sql t=
o
>> calculate the distinct values separately (spark sql is really fast so ~~=
).
>>
>> thanks again.
>>
>> 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan <gopalv@apache.org>:
>>
>>> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
>>> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
>>> =E2=80=A6
>>> > GROUPING_ID() AS gid,
>>> > COUNT(1) AS dummy
>>>
>>> There are two things which prevent Hive from optimize multiple count
>>> distincts.
>>>
>>> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE=
.
>>>
>>> The multiple count distincts are rewritten into a ROLLUP internally by
>>> the CBO.
>>>
>>> https://issues.apache.org/jira/browse/HIVE-10901
>>>
>>> A single count distinct + other aggregates (like
>>> min,max,count,count_distinct in 1 pass) is fixed via
>>>
>>> https://issues.apache.org/jira/browse/HIVE-16654
>>>
>>> There's no optimizer rule to combine both those scenarios.
>>>
>>> https://issues.apache.org/jira/browse/HIVE-15045
>>>
>>> There's a possibility that you're using Hive-1.x release branch the CBO
>>> doesn't kick in unless column stats are present, but in the Hive-2.x se=
ries
>>> you'll notice that some of these optimizations are not driven by a cost
>>> function and are always applied if CBO is enabled.
>>>
>>> > is there any way to rewrite it to optimize the memory usage.
>>>
>>> If you want it to run through very slowly without errors, you can try
>>> disabling all in-memory aggregations.
>>>
>>> set hive.map.aggr=3Dfalse;
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>>
>> --
>> =E4=B8=8D=E5=AD=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93
>>
>
>
>
> --
> =E4=B8=8D=E5=AD=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93
>


--=20
=E4=B8=8D=E5=AD=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93

--001a113bd8fc214621055775f28f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">by decreasing=C2=A0<span style=3D"color:rgb(68,68,68);font=
-family:Roboto,Arial,sans-serif;font-size:13px">mapreduce.reduce.shuffle.pa=
rallelcopies from 20 to 5, =C2=A0it seems that everything goes well, no OOM=
 ~~</span></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2=
017-08-23 17:19 GMT+08:00 panfei <span dir=3D"ltr">&lt;<a href=3D"mailto:cn=
weike@gmail.com" target=3D"_blank">cnweike@gmail.com</a>&gt;</span>:<br><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex"><div dir=3D"ltr">The full error stack is (which=
 described here : <a href=3D"https://issues.apache.org/jira/browse/MAPREDUC=
E-6108" target=3D"_blank">https://issues.apache.org/<wbr>jira/browse/MAPRED=
UCE-6108</a>) :<div><br></div><div>this error can not reproduce every time,=
 after retry several times, the job successfully finished.</div><div><br></=
div><div><pre style=3D"padding:9.5px;font-family:Monaco,Menlo,Consolas,&quo=
t;Courier New&quot;,monospace;font-size:12px;color:rgb(68,68,68);border-rad=
ius:0px;margin-top:0px;margin-bottom:10px;line-height:20px;word-break:break=
-all;word-wrap:break-word;white-space:pre-wrap;background-color:rgb(245,245=
,245);border-top:none;border-right:none;border-bottom:none;border-left:5px =
solid rgb(221,221,221);overflow:auto;height:331px">2017-08-23 17:16:03,574 =
WARN [main] org.apache.hadoop.mapred.<wbr>YarnChild: Exception running chil=
d : org.apache.hadoop.mapreduce.<wbr>task.reduce.Shuffle$<wbr>ShuffleError:=
 error in shuffle in fetcher#2
	at org.apache.hadoop.mapreduce.<wbr>task.reduce.Shuffle.run(<wbr>Shuffle.j=
ava:134)
	at org.apache.hadoop.mapred.<wbr>ReduceTask.run(ReduceTask.<wbr>java:376)
	at org.apache.hadoop.mapred.<wbr>YarnChild$2.run(YarnChild.<wbr>java:164)
	at java.security.<wbr>AccessController.doPrivileged(<wbr>Native Method)
	at javax.security.auth.Subject.<wbr>doAs(Subject.java:422)
	at org.apache.hadoop.security.<wbr>UserGroupInformation.doAs(<wbr>UserGrou=
pInformation.java:<wbr>1657)
	at org.apache.hadoop.mapred.<wbr>YarnChild.main(YarnChild.java:<wbr>158)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at <a href=3D"http://org.apache.hadoop.io">org.apache.hadoop.io</a>.<wbr>B=
oundedByteArrayOutputStream.&lt;<wbr>init&gt;(<wbr>BoundedByteArrayOutputSt=
ream.<wbr>java:56)
	at <a href=3D"http://org.apache.hadoop.io">org.apache.hadoop.io</a>.<wbr>B=
oundedByteArrayOutputStream.&lt;<wbr>init&gt;(<wbr>BoundedByteArrayOutputSt=
ream.<wbr>java:46)
	at org.apache.hadoop.mapreduce.<wbr>task.reduce.InMemoryMapOutput.<wbr>&lt=
;init&gt;(InMemoryMapOutput.java:<wbr>63)
	at org.apache.hadoop.mapreduce.<wbr>task.reduce.MergeManagerImpl.<wbr>unco=
nditionalReserve(<wbr>MergeManagerImpl.java:305)
	at org.apache.hadoop.mapreduce.<wbr>task.reduce.MergeManagerImpl.<wbr>rese=
rve(MergeManagerImpl.java:<wbr>295)
	at org.apache.hadoop.mapreduce.<wbr>task.reduce.Fetcher.<wbr>copyMapOutput=
(Fetcher.java:<wbr>514)
	at org.apache.hadoop.mapreduce.<wbr>task.reduce.Fetcher.<wbr>copyFromHost(=
Fetcher.java:336)
	at org.apache.hadoop.mapreduce.<wbr>task.reduce.Fetcher.run(<wbr>Fetcher.j=
ava:193)

2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task: Runnning=
 cleanup for the task</pre></div></div><div class=3D"gmail_extra"><div><div=
 class=3D"h5"><br><div class=3D"gmail_quote">2017-08-23 13:10 GMT+08:00 pan=
fei <span dir=3D"ltr">&lt;<a href=3D"mailto:cnweike@gmail.com" target=3D"_b=
lank">cnweike@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><=
div dir=3D"ltr">Hi Gopal, Thanks for all the information and suggestion.<di=
v><br></div><div>The Hive version is 2.0.1 and use=C2=A0Hive-on-MR as the e=
xecution engine.</div><div><br></div><div>I think I should create a interme=
diate table which includes all the dimensions (including the serval kinds o=
f ids), and then use spark-sql to calculate the distinct values separately =
(spark sql is really fast so ~~).</div><div><br></div><div>thanks again.</d=
iv></div><div class=3D"gmail_extra"><div><div class=3D"m_-77009145001545452=
16h5"><br><div class=3D"gmail_quote">2017-08-23 12:56 GMT+08:00 Gopal Vijay=
araghavan <span dir=3D"ltr">&lt;<a href=3D"mailto:gopalv@apache.org" target=
=3D"_blank">gopalv@apache.org</a>&gt;</span>:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex"><span>&gt; COUNT(DISTINCT monthly_user_id) AS monthly_active_users,<br=
>
&gt; COUNT(DISTINCT weekly_user_id) AS weekly_active_users,<br>
</span>=E2=80=A6<br>
<span>&gt; GROUPING_ID() AS gid,<br>
&gt; COUNT(1) AS dummy<br>
<br>
</span>There are two things which prevent Hive from optimize multiple count=
 distincts.<br>
<br>
Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.<br=
>
<br>
The multiple count distincts are rewritten into a ROLLUP internally by the =
CBO.<br>
<br>
<a href=3D"https://issues.apache.org/jira/browse/HIVE-10901" rel=3D"norefer=
rer" target=3D"_blank">https://issues.apache.org/jira<wbr>/browse/HIVE-1090=
1</a><br>
<br>
A single count distinct + other aggregates (like min,max,count,count_distin=
ct in 1 pass) is fixed via<br>
<br>
<a href=3D"https://issues.apache.org/jira/browse/HIVE-16654" rel=3D"norefer=
rer" target=3D"_blank">https://issues.apache.org/jira<wbr>/browse/HIVE-1665=
4</a><br>
<br>
There&#39;s no optimizer rule to combine both those scenarios.<br>
<br>
<a href=3D"https://issues.apache.org/jira/browse/HIVE-15045" rel=3D"norefer=
rer" target=3D"_blank">https://issues.apache.org/jira<wbr>/browse/HIVE-1504=
5</a><br>
<br>
There&#39;s a possibility that you&#39;re using Hive-1.x release branch the=
 CBO doesn&#39;t kick in unless column stats are present, but in the Hive-2=
.x series you&#39;ll notice that some of these optimizations are not driven=
 by a cost function and are always applied if CBO is enabled.<br>
<span><br>
&gt; is there any way to rewrite it to optimize the memory usage.<br>
<br>
</span>If you want it to run through very slowly without errors, you can tr=
y disabling all in-memory aggregations.<br>
<br>
set hive.map.aggr=3Dfalse;<br>
<br>
Cheers,<br>
Gopal<br>
<br>
<br>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span c=
lass=3D"m_-7700914500154545216HOEnZb"><font color=3D"#888888">-- <br><div c=
lass=3D"m_-7700914500154545216m_-1633849113659966941gmail_signature" data-s=
martmail=3D"gmail_signature">=E4=B8=8D=E5=AD=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=
=E7=9F=A5=E9=81=93<br></div>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span c=
lass=3D"HOEnZb"><font color=3D"#888888">-- <br><div class=3D"m_-77009145001=
54545216gmail_signature" data-smartmail=3D"gmail_signature">=E4=B8=8D=E5=AD=
=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93<br></div>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature" data-smartmail=3D"gmail_signature">=E4=B8=8D=E5=AD=A6=
=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93<br></div>
</div>

--001a113bd8fc214621055775f28f--