hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devopam Mittra <devo...@gmail.com>
Subject Re: Tuning Triangle Joins on Hive
Date Fri, 01 Aug 2014 03:40:50 GMT
Please try the following approach and let me know if you are not getting
better performance:

1. Ensure indexes are present on dst , rsc columns in the respective tables.
2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a
physical table, and then create index on its new src column as well
3. Join this to r1

If this approach works well, then try out the WITH SELECT ... using the
same approach , just no physical intermediate table will be created.

Hope it helps..

regards
Dev




On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fabuzaid@stanford.edu>
wrote:

> Hi,
>
> We're running various "triangle" join queries on Hive 0.9.0, and we're
> wondering if we can get any better performance. Here's the query we're
> running:
>
> SELECT count(*)
> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst
> = r3.src AND r3.dst = r1.src)
> WHERE r1.src < r2.src AND r2.src < r3.src;
>
> We're currently passing the following tuning parameters as well:
>
> set mapred.map.tasks=120;
> set mapred.reduce.tasks=120;
> set mapred.tasktracker.map.tasks.maximum=8;
> set mapred.tasktracker.reduce.tasks.maximum=8;
> set mapred.child.java.opts=-Xmx5120m;
>
> The dataset we're using has 5 million nodes and 70 million edges, and most
> of our time is spent on garbage collection. We have about 30 machines in
> our cluster, and each machine has 45GB of RAM. Any thoughts on how we can
> improve performance? Thanks in advance!
>



-- 
Devopam Mittra
Life and Relations are not binary

Mime
View raw message