hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Premal Shah <premal.j.s...@gmail.com>
Subject Re: Hive query on ORC table is really slow compared to Presto
Date Wed, 21 Jun 2017 17:24:49 GMT
Gopal,
Thanx for the debugging steps.

Here's the output

*hive> select count(1) as collisions, hash(ip) from table group by hash(ip)
order by collisions desc limit 10;*

4       -1432955330
4       -317748560
4       -1460629578
4       1486313154
4       -320519155
4       1875999753
4       -1410139032
4       1596671554
4       503687909
4       989075923


*hive> select count(1) as collisions, hash(id) from table group by hash(id)
order by collisions desc limit 10;*

1711647 -1032220119
1439738 -1316837863
851204  -330948739
838145  535385402
512621  165206418
308968  -36549075
306190  -1568034366
302932  -1386594327
284935  -1991768757
218979  922811836


*hive> select count(1) as collisions, hash(name) from table group by
hash(name) order by collisions desc limit 10;*

1712041 -1906315012
1439738 -1583171535
512630  193448621
340485  2094529
308988  68745436
306240  79997099
289465  -1824055323
218263  1074334059
216464  -466945424


Turning off map side aggregations definitely helped the query on *id . *The
query time went to 1 minute from the earlier 3+ hours.

Based on the output above, both id and name have a lot of collisions, but
the name query was fast earlier too which is interesting.




On Wed, Jun 14, 2017 at 10:34 AM, Gopal Vijayaraghavan <gopalv@apache.org>
wrote:

>
> > SELECT COUNT(DISTINCT ip) FROM table - 71 seconds
> > SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds
>
> Ok, I misunderstood your gist.
>
> > While ip is more unique that id, ip runs many times faster than id.
> >
> > How can I debug this ?
>
> Nearly the same way - just replace "ip" with "id" in my exploratory
> queries.
>
> count(distinct hash(id)) from the table?
>
> count count(1) as collisions, hash(id) from table group by hash(id) order
> by collisions desc limit 10;
>
> And, if those show many collisions
>
> set tez.runtime.io.sort.mb=640;
> set hive.map.aggr=false;
> set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance
> (i.e retries are more expensive, happy path is faster)
>
> select count(distinct id) from ip_table;
>
> Java's hashCode() implementation is pretty horrible (& Hive defaults to
> using it). If you're seeing a high collision count, I think I might know
> what's happening here.
>
> Cheers,
> Gopal
>
>
>


-- 
Regards,
Premal Shah.

Mime
View raw message