Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id AD085200CFC for ; Thu, 24 Aug 2017 03:43:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id AB2D416A247; Thu, 24 Aug 2017 01:43:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A630016A245 for ; Thu, 24 Aug 2017 03:43:05 +0200 (CEST) Received: (qmail 50667 invoked by uid 500); 24 Aug 2017 01:43:04 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 50657 invoked by uid 99); 24 Aug 2017 01:43:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Aug 2017 01:43:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 9C7E7C01E5 for ; Thu, 24 Aug 2017 01:43:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.401 X-Spam-Level: X-Spam-Status: No, score=-0.401 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id j7MjIFdUd47a for ; Thu, 24 Aug 2017 01:43:01 +0000 (UTC) Received: from mail-wr0-f172.google.com (mail-wr0-f172.google.com [209.85.128.172]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 38F245F238 for ; Thu, 24 Aug 2017 01:42:56 +0000 (UTC) Received: by mail-wr0-f172.google.com with SMTP id f8so5428443wrf.3 for ; Wed, 23 Aug 2017 18:42:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=AqTxFbR0CzfV7emgqw/JcBInko3JeQVS4UUmjyjj+Xo=; b=GFUwl1N7x2TAvSxMSJ2Wolq3eXW0X0AhTIblPrc58oc4XkJyXtrohb8Ulj4yZHWFQF ryf/8zL3OhBfgBFD+HOWIyrCvc/lhPwP1hwMycVugi1K0IvLeEgmLnx5sSd43E6YsUQS EJ7Qe8j0Z/9dQEmzRwm/DAqy9YFJiLQdK1E7yjtU+1n9VazwQY0qCOZwIZcKeottdFME 7PEHXdzrmmZ33nQzPYIf33GXZvUXMzHJaCvP0zTGQhWwg9OSDewhx3hIJ15RAX//PK4Q msoQnL+YI6rKhTC6y1/Iz9QzP6OP5004c1K9uYHWh7g1DO9yiHVwV9Ww96YRQDT4PK6j BxGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=AqTxFbR0CzfV7emgqw/JcBInko3JeQVS4UUmjyjj+Xo=; b=khOgyezctESzInbVRSWwHCp6AamVHlIYWMJy0mKD4bW8aWJ+jPSGnmOTuyxC05pN5Q ZGGIjeTEwklrCFuGu4Lvrd/d+OJXKm7L8f4/GDg94tFTFHhIhScn9OL8ZpNTWC+auJvH iuz0wpz1rWbosKwgSeum8gHfe2AXL/zQVC/Ygtw27yGaohFVWP2kmUZ1rGvuAC+AclOK eXVZY5R9//gtpGz2GOKOOlzolJqDiL+OcPhSaRGf5VdhU3ufOJmXzWAhMjo9H4xwL06g dmfYPGQigh+AoAsUcMUD1Li9e6LqmUmjPlCiKC+zbinQWvkCsfqhJLU9jjck35e+IVos hk4Q== X-Gm-Message-State: AHYfb5gPa+n8GPQcFgAWPMimVaA5naZO9PpjdYuqS+z+sLWObIpyy1yE vTLvLljMTynYjQeEWsjFVptNfTnIVbgH X-Received: by 10.223.176.164 with SMTP id i33mr501189wra.221.1503538968608; Wed, 23 Aug 2017 18:42:48 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.172.46 with HTTP; Wed, 23 Aug 2017 18:42:48 -0700 (PDT) In-Reply-To: References: From: panfei Date: Thu, 24 Aug 2017 09:42:48 +0800 Message-ID: Subject: Re: How to optimize multiple count( distinct col) in Hive SQL To: user@hive.apache.org Content-Type: multipart/alternative; boundary="001a113bd8fc214621055775f28f" archived-at: Thu, 24 Aug 2017 01:43:06 -0000 --001a113bd8fc214621055775f28f Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable by decreasing mapreduce.reduce.shuffle.parallelcopies from 20 to 5, it seems that everything goes well, no OOM ~~ 2017-08-23 17:19 GMT+08:00 panfei : > The full error stack is (which described here : https://issues.apache.org= / > jira/browse/MAPREDUCE-6108) : > > this error can not reproduce every time, after retry several times, the > job successfully finished. > > 2017-08-23 17:16:03,574 WARN [main] org.apache.hadoop.mapred.YarnChild: E= xception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$Sh= uffleError: error in shuffle in fetcher#2 > at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma= tion.java:1657) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) > Caused by: java.lang.OutOfMemoryError: Java heap space > at org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteA= rrayOutputStream.java:56) > at org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteA= rrayOutputStream.java:46) > at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.(InMe= moryMapOutput.java:63) > at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditiona= lReserve(MergeManagerImpl.java:305) > at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(Merg= eManagerImpl.java:295) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher= .java:514) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.= java:336) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) > > 2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task: Runnni= ng cleanup for the task > > > 2017-08-23 13:10 GMT+08:00 panfei : > >> Hi Gopal, Thanks for all the information and suggestion. >> >> The Hive version is 2.0.1 and use Hive-on-MR as the execution engine. >> >> I think I should create a intermediate table which includes all the >> dimensions (including the serval kinds of ids), and then use spark-sql t= o >> calculate the distinct values separately (spark sql is really fast so ~~= ). >> >> thanks again. >> >> 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan : >> >>> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users, >>> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users, >>> =E2=80=A6 >>> > GROUPING_ID() AS gid, >>> > COUNT(1) AS dummy >>> >>> There are two things which prevent Hive from optimize multiple count >>> distincts. >>> >>> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE= . >>> >>> The multiple count distincts are rewritten into a ROLLUP internally by >>> the CBO. >>> >>> https://issues.apache.org/jira/browse/HIVE-10901 >>> >>> A single count distinct + other aggregates (like >>> min,max,count,count_distinct in 1 pass) is fixed via >>> >>> https://issues.apache.org/jira/browse/HIVE-16654 >>> >>> There's no optimizer rule to combine both those scenarios. >>> >>> https://issues.apache.org/jira/browse/HIVE-15045 >>> >>> There's a possibility that you're using Hive-1.x release branch the CBO >>> doesn't kick in unless column stats are present, but in the Hive-2.x se= ries >>> you'll notice that some of these optimizations are not driven by a cost >>> function and are always applied if CBO is enabled. >>> >>> > is there any way to rewrite it to optimize the memory usage. >>> >>> If you want it to run through very slowly without errors, you can try >>> disabling all in-memory aggregations. >>> >>> set hive.map.aggr=3Dfalse; >>> >>> Cheers, >>> Gopal >>> >>> >>> >> >> >> -- >> =E4=B8=8D=E5=AD=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93 >> > > > > -- > =E4=B8=8D=E5=AD=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93 > --=20 =E4=B8=8D=E5=AD=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93 --001a113bd8fc214621055775f28f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
by decreasing=C2=A0mapreduce.reduce.shuffle.pa= rallelcopies from 20 to 5, =C2=A0it seems that everything goes well, no OOM= ~~

2= 017-08-23 17:19 GMT+08:00 panfei <cnweike@gmail.com>:
The full error stack is (which= described here : https://issues.apache.org/jira/browse/MAPRED= UCE-6108) :

this error can not reproduce every time,= after retry several times, the job successfully finished.

2017-08-23 17:16:03,574 =
WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running chil=
d : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:=
 error in shuffle in fetcher#2
	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.j=
ava:134)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGrou=
pInformation.java:1657)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at org.apache.hadoop.io.B=
oundedByteArrayOutputStream.<init>(BoundedByteArrayOutputSt=
ream.java:56)
	at org.apache.hadoop.io.B=
oundedByteArrayOutputStream.<init>(BoundedByteArrayOutputSt=
ream.java:46)
	at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<=
;init>(InMemoryMapOutput.java:63)
	at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unco=
nditionalReserve(MergeManagerImpl.java:305)
	at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.rese=
rve(MergeManagerImpl.java:295)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput=
(Fetcher.java:514)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(=
Fetcher.java:336)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.j=
ava:193)

2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task: Runnning=
 cleanup for the task

2017-08-23 13:10 GMT+08:00 pan= fei <cnweike@gmail.com>:
<= div dir=3D"ltr">Hi Gopal, Thanks for all the information and suggestion.
The Hive version is 2.0.1 and use=C2=A0Hive-on-MR as the e= xecution engine.

I think I should create a interme= diate table which includes all the dimensions (including the serval kinds o= f ids), and then use spark-sql to calculate the distinct values separately = (spark sql is really fast so ~~).

thanks again.

2017-08-23 12:56 GMT+08:00 Gopal Vijay= araghavan <gopalv@apache.org>:
> COUNT(DISTINCT monthly_user_id) AS monthly_active_users, > COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
=E2=80=A6
> GROUPING_ID() AS gid,
> COUNT(1) AS dummy

There are two things which prevent Hive from optimize multiple count= distincts.

Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.
The multiple count distincts are rewritten into a ROLLUP internally by the = CBO.

https://issues.apache.org/jira/browse/HIVE-1090= 1

A single count distinct + other aggregates (like min,max,count,count_distin= ct in 1 pass) is fixed via

https://issues.apache.org/jira/browse/HIVE-1665= 4

There's no optimizer rule to combine both those scenarios.

https://issues.apache.org/jira/browse/HIVE-1504= 5

There's a possibility that you're using Hive-1.x release branch the= CBO doesn't kick in unless column stats are present, but in the Hive-2= .x series you'll notice that some of these optimizations are not driven= by a cost function and are always applied if CBO is enabled.

> is there any way to rewrite it to optimize the memory usage.

If you want it to run through very slowly without errors, you can tr= y disabling all in-memory aggregations.

set hive.map.aggr=3Dfalse;

Cheers,
Gopal





--
=E4=B8=8D=E5=AD=A6=E4=B9=A0=EF=BC=8C=E4=B8=8D= =E7=9F=A5=E9=81=93



--
=E4=B8=8D=E5=AD= =A6=E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93



--
=E4=B8=8D=E5=AD=A6= =E4=B9=A0=EF=BC=8C=E4=B8=8D=E7=9F=A5=E9=81=93
--001a113bd8fc214621055775f28f--