hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal V <gop...@apache.org>
Subject Re: Tuning Triangle Joins on Hive
Date Wed, 06 Aug 2014 21:03:17 GMT
On 7/31/14, 12:28 PM, Firas Abuzaid wrote:

> We're running various "triangle" join queries on Hive 0.9.0, and we're
> wondering if we can get any better performance. Here's the query we're
> running:
>
> SELECT count(*)
> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst =
> r3.src AND r3.dst = r1.src)
> WHERE r1.src < r2.src AND r2.src < r3.src;


This is an interesting use-case for a JOIN clause.

The main overhead is that the JOIN criteria is materializing JOINs onto 
HDFS.

Neither hive indexes, nor ORC indexes will help you here, unfortunately.

With the latest hive-13, I tried rewriting this to get a performance 
boost with a CTE (which will be streamed through in tez‚ but MR will 
still write it as SequenceFiles to HDFS).

Roughly, you need to remove the implicit filter of r1.src < r2.src && 
r2.src < r3.src into an independent sub query.

explain
with r2_tmp as (
select r2.src as r2_src, r3.dst as r3_dst from r2 join r3 on (r2.dst = 
r3.src) where r2.src < r3.src
)
select count(1) from r1 JOIN r2_tmp on (r1.dst = r2_src and r1.src = r3_dst)
where (r1.src < r2_src)

Vectorization+ORC will massively reduce the GC overhead for this as well 
- but for all these you need a modern hive version.

Data organization can give a bigger performance boost as well, because 
of how ORC will do run-length packing of the same valued columns. I'd 
just sort/bucket on src for all of them.

You can see a similar organization for easier big table JOINs here - 
https://github.com/t3rmin4t0r/nyc-taxi-bigdata/blob/master/ddl/orc.sql#L26

HTH.

Cheers,
Gopal

Mime
View raw message