hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabi D <gabi...@gmail.com>
Subject Huge join performance issue
Date Thu, 04 Apr 2013 12:23:05 GMT
Hi all,
I have two tables I need to join and then summarize.
They are both huge (about 1B rows each, in the relevant partitions) and the
query runs for over 2 hours creating 5T intermediate data.

The current query looks like this:

select t1.b,t1.c,t2.d,t2.e, count(*)
from (select a,b,c    from table baseTB1 where ... ) t1  -- filter by
partition as well
  join
        (select a,d,e from baseTB2 where ...) t2    -- filter by partition
as well
on t1.a=t2.a
group by t1.b,t1.c,t2.d,t2.e


two questions:
1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be
better in any way?
          (I know subqueries cause a lot of writes of the intermediate data
but we also understand it's best to filter down the data that is being
joined, which is "more" correct?)
2. can I use 'distribute by ' and/or 'sort by' in some way that would help
this? my understanding at the moment is that the problem lies in the fact
that the reduces are on column a while the group by is on column b ...

Any thoughts would be appreciated.

Mime
View raw message