hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Pestritto <>
Subject Join Issue with Multiple Reducers
Date Thu, 23 Apr 2009 19:44:07 GMT

I wanted to ask if anyone has seen the following behavior in Hive.

When I execute a cross join ( join with no ON statement) across multiple
reducers, I only get output  = 1/ <number of reducers>.  E.g.  I have 1
million rows in 1 table and 1 row in another table and do a join with no on
statement.  If this is executed with 5 reducers, I get 200k rows out instead
of 1M.  If I change the join to an outer join, I get 1M rows output but only
1/5 of the rows have values from the table in the join.

select a.col1, b.col1 from tbl1 a join tbl2 b ;
tbl1 has 1M records.
tbl2 has 1 record.
There are only 200K records output if run across 5 reducers.

If I change to an outer join :
select a.col1, b.col1 from tbl1 a left outer join tbl2 b ;
There are 1M records output, but only 200K of them have a value for b.col1.
The others have a null.

My workaround is just running with 1 reducer, but it took me a while to
figure this out.


View raw message