Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
User-Agent: Microsoft-MacOutlook/f.19.0.160817
Date: Mon, 12 Jun 2017 16:42:25 -0700
Subject: Re: Hive query on ORC table is really slow compared to Presto
From: Gopal Vijayaraghavan <gopalv@apache.org>
Sender: Gopal Vijayaraghavan <gopal@hortonworks.com>
To: "user@hive.apache.org" <user@hive.apache.org>
CC: Premal Shah <premal.j.shah@gmail.com>
Message-ID: <92978D5B-6EB7-4269-BDE5-21051EDD2EE6@hortonworks.com>
Thread-Topic: Hive query on ORC table is really slow compared to Presto
References: <CAH72Ak8oTvrZDanUJkYv1nKAagrBRGb=DyYLvtU283MB68=pCw@mail.gmail.com>
 <D8266584-8568-4313-8E31-3E574196DDDE@hortonworks.com>
 <CAH72Ak-KhFTPpo7T94ibLkOfft8AykGDqpVGAvKkoEGRV9D+AA@mail.gmail.com>
In-Reply-To: <CAH72Ak-KhFTPpo7T94ibLkOfft8AykGDqpVGAvKkoEGRV9D+AA@mail.gmail.com>
Mime-version: 1.0
Content-type: text/plain;
	charset="UTF-8"
Content-transfer-encoding: quoted-printable
archived-at: Mon, 12 Jun 2017 23:42:47 -0000

Hi,

I think this is worth fixing because this seems to be triggered by the data=
 quality itself - so let me dig in a bit into a couple more scenarios.

> hive.optimize.distinct.rewrite is True by default

FYI, we're tackling the count(1) + count(distinct col) case in the Optimize=
r now (which came up after your original email).

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#C=
onfigurationProperties-hive.optimize.countdistinct

> On running the orcfiledump utility, I see that the column on which I want=
 to run the distinct query is encoded with a DIRECT encoding.=C2=A0 When I run d=
istinct on other columns in the table that are encoded with the dictionary e=
ncoding, the query runs quickly.=C2=A0

So the cut-off for dictionary encoding is that the value repeats at least ~=
2x in each stripe - so very unique patterns won't trigger this.

If the total # of rows of IP =3D=3D total IP values, I don't expect it to be en=
coded as a dictionary.

Also interesting detail - I prefer to now store IPs as 2 bigint cols.

bigint ip1, bigint ip2

This was primarily driven by the crazy math required to join different cont=
ractions of the IPv6 formatting.

The two colon contractions are crazy when you want to joins across differen=
t data sources, if you store as a text string. Maybe 2017 is the year of IPv=
6 :D.

> CLUSTERED BY (ip)=C2=A0INTO 16 BUCKETS

This is something that completely annoys me - CLUSTERED BY does not cluster=
, but that doesn't help you here since IP is unique.

You need SORTED BY (ip) to properly generate clusters in Hive.

> Running a count(distinct) query on master id took 3+ hours. It looks like=
 the CPU was busy when running this query.

Can you do me a favour and run some intermediate state data exploratory que=
ries, because some part of the slowness is probably triggered due to the fai=
lure tolerance checkpoints.

count(distinct hash(ip)) from the table?=20

count count(1) as collisions, hash(ip) from table group by hash(ip) order b=
y collisions desc limit 10;

And, if those show many collisions

set tez.runtime.io.sort.mb=3D640;
set hive.map.aggr=3Dfalse;
set tez.runtime.pipelined.shuffle=3Dtrue; // this reduces failure tolerance (=
i.e retries are more expensive, happy path is faster)

select count(distinct ip) from ip_table;

Cheers,
Gopal=20