hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Hive query on ORC table is really slow compared to Presto
Date Mon, 12 Jun 2017 23:42:25 GMT

I think this is worth fixing because this seems to be triggered by the data quality itself
- so let me dig in a bit into a couple more scenarios.

> hive.optimize.distinct.rewrite is True by default

FYI, we're tackling the count(1) + count(distinct col) case in the Optimizer now (which came
up after your original email).

> On running the orcfiledump utility, I see that the column on which I want to run the
distinct query is encoded with a DIRECT encoding.  When I run distinct on other columns in
the table that are encoded with the dictionary encoding, the query runs quickly. 

So the cut-off for dictionary encoding is that the value repeats at least ~2x in each stripe
- so very unique patterns won't trigger this.

If the total # of rows of IP == total IP values, I don't expect it to be encoded as a dictionary.

Also interesting detail - I prefer to now store IPs as 2 bigint cols.

bigint ip1, bigint ip2

This was primarily driven by the crazy math required to join different contractions of the
IPv6 formatting.

The two colon contractions are crazy when you want to joins across different data sources,
if you store as a text string. Maybe 2017 is the year of IPv6 :D.


This is something that completely annoys me - CLUSTERED BY does not cluster, but that doesn't
help you here since IP is unique.

You need SORTED BY (ip) to properly generate clusters in Hive.

> Running a count(distinct) query on master id took 3+ hours. It looks like the CPU was
busy when running this query.

Can you do me a favour and run some intermediate state data exploratory queries, because some
part of the slowness is probably triggered due to the failure tolerance checkpoints.

count(distinct hash(ip)) from the table? 

count count(1) as collisions, hash(ip) from table group by hash(ip) order by collisions desc
limit 10;

And, if those show many collisions

set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance (i.e retries are
more expensive, happy path is faster)

select count(distinct ip) from ip_table;


View raw message