hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Hive query on ORC table is really slow compared to Presto
Date Wed, 14 Jun 2017 17:34:46 GMT

> SELECT COUNT(DISTINCT ip) FROM table - 71 seconds
> SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds

Ok, I misunderstood your gist.

> While ip is more unique that id, ip runs many times faster than id.
> How can I debug this ?

Nearly the same way - just replace "ip" with "id" in my exploratory queries.
count(distinct hash(id)) from the table?

count count(1) as collisions, hash(id) from table group by hash(id) order by collisions desc
limit 10;

And, if those show many collisions

set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance (i.e retries are
more expensive, happy path is faster)

select count(distinct id) from ip_table;

Java's hashCode() implementation is pretty horrible (& Hive defaults to using it). If
you're seeing a high collision count, I think I might know what's happening here.


View raw message