hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ning Zhang <>
Subject Re: Issues with joining across large tables
Date Mon, 26 Oct 2009 17:47:50 GMT
This query is using the regular join to simulate semijoin. The proposed semijoin syntax is
something like:

select from A left semi join B on =;

Semantically semijoin is the same as the simulated query using JOIN. but proposed semijoin
operation is more efficient in that:
1) the simulation has a group by which requires a separate map-reduce task. In the semijoin,
the group by is map-side only so that we can eliminate one map-reduce task.
2) the semijoin implements the early-exit semantics: whenever a match is found on B, the rest
of the tuples in B will be omitted. While the JOIN will keep matching even though a match
is found.


The difference between the semijoin implementation and the simulation using regular join is
that the group by in the

On Oct 26, 2009, at 9:56 AM, Ryan LeCompte wrote:

I got the query working (my bad syntax caused it to error out)... However, can you please
explain what Hive is doing in your example query for the semijoin?


On Mon, Oct 26, 2009 at 12:52 PM, Ryan LeCompte <<>>
Also, if I try your query I get:

FAILED: Parse Error: line 1:8 cannot recognize input 'DISTINCT' in expression specification

Seems like it doesn't like SELECT (DISTINCT ...)

On Mon, Oct 26, 2009 at 12:47 PM, Ryan LeCompte <<>>
Ah, got it!

Can you explain your semijoin query though? I see that you are using JOIN but I don't see
the ON (...) part. What does Hive do in this case when ON (..) doesn't exist?

On Mon, Oct 26, 2009 at 12:41 PM, Ning Zhang <<>>
I see. If you just want all userid in one table that also appear in another table, semijoin
would help (it is still under implementation though). You can simulate semijoin using JOIN
as follows:

select (distinct ut.userid) from usertracking ut join (select usertrackingid from streamtransfers
group by usertrackingid where usertrackingid is not null and usertrackingid<>0) where
userid is not null and userid <> 0;

BTW, the rewritten query using UNION ALL is not semantically equivalent to the original query:
it doesn't return "common" user IDs in both table, but just the "union" of them.


On Oct 26, 2009, at 9:33 AM, Ryan LeCompte wrote:

Hi Ning,

Basically, I have roughly 60GB of log data that is logically broken up into two Hive tables,
which works out to be those two tables that I had mentioned. Both of these tables share a
common key, but this key appears in virtually every row of each of the two tables. Each of
these tables just has 1 partition (by date). My query is very similar (although with different
data/columns) to Chris Bates' query:

SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers ST ON (ST.usertrackingid
= UT.usertrackingid) WHERE UT.UserID IS NOT NULL AND UT.UserID <> 0;

However, in the above example imagine that in both tables the same users appear a lot, so
there are lots of matches.

On Mon, Oct 26, 2009 at 12:27 PM, Ning Zhang <<>>
If it is really a Cartesian product, there is no better way other than increasing the timeout
for the reducers. You can do a back-of-the-envelope calculation on how long it takes (e.g.,
in your log it shows it takes 19 sec. to get 6 million rows out of the join). This calculation
can also be done if it is not a Cartesian product, and you have an good estimate of how many
rows will be produced.

In general, joining two huge tables can be avoided by partitioning the fact tables, so that
you don't need to join the whole table.

BTW, are you joining two fact tables or one dimension table is just huge?


On Oct 26, 2009, at 5:16 AM, Ryan LeCompte wrote:

So it turns out that the JOIN key of my query basically results in a match/join on all rows
of each table! There really is no extra filtering that I can do to exclude invalid rows, etc.
The mappers fly by and complete, but the reducers are just moving extremely slowly (my guess
due to what Zheng said about the Cartesian product of all rows getting matched).

Is there some other way that I could re-write the JOIN or is my only option to increase the
timeout on the task trackers so that they don't timeout/kill the reducers? I've already upped
their timeouts to 30 minutes (as opposed to the default of 10), and it doesn't seem to be
sufficient... Again, this is joining a 33GB table with a 13GB table where join key is shared
by virtually all rows in both tables.


On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte <<>>
Thanks guys, very useful information. I will modify my query a bit and get back to you guys
on whether it worked or not.


On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates <<>>

I asked this question a couple days ago but in a slightly different form.  What you have to
do is make sure the table you're joining is smaller than the leftmost table.  As an example,

SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers ST ON (ST.usertrackingid
= UT.usertrackingid) WHERE UT.UserID IS NOT NULL AND UT.UserID <> 0;

In this query, usertracking is a table that is about 8 or 9 gigs.  Streamtransfers is a table
that is about 4 gigs.  As per Zheng's comment, I omitted UserID's of Null or Zero as there
are many rows with this key and the join worked as intended.


PS. As an aside, Hive is proving to be quite useful to all of our database hackers here at
Grooveshark.  Thanks to everyone who has committed...I hope to contribute soon.

On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <<>>
It's probably caused by the Cartesian product of many rows from the two tables with the same


On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte <<>>
It also looks like the reducers just never stop outputting things likethe (following  -- see
below), causing them to ultimately time out and get killed by the system.

2009-10-25 22:21:18,879 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100000000

2009-10-25 22:21:22,009 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 101000000
2009-10-25 22:21:22,010 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 101000000
2009-10-25 22:21:25,141 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 102000000

2009-10-25 22:21:25,142 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 102000000
2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 103000000
2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 103000000

2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104000000
2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 104000000
2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105000000

2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 105000000
2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106000000
2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 106000000

On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte <<>>
Hello all,

Should I expect to be able to do a Hive JOIN between two tables that have about 10 or 15GB
of data each? What I'm noticing (for a simple JOIN) is that all the map tasks complete, but
the reducers just hang at around 87% or so (for the first set of 4 reducers), and then they
eventually just get killed due to inability to respond by the cluster. I can do a JOIN between
a large table and a very small table of 10 or so records just fine.

Any thoughts?



View raw message