hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namit Jain <>
Subject RE: Set difference in Hive
Date Tue, 30 Jun 2009 00:04:34 GMT
The tables can be large  -

For a given key,  have the table with the most number of values as the rightmost table.

The problem only happens when both the tables have keys with large number of values.


From: Rakesh Setty []
Sent: Monday, June 29, 2009 4:43 PM
Subject: RE: Set difference in Hive

Thanks very much. But the reducer hangs with the warning WARN org.apache.hadoop.hive.ql.exec.JoinOperator:
table 0 has more than joinEmitInterval rows for join key []

Both the tables are large and as Zheng mentions at,
large size for table 0 is a problem. Is there any way to overcome this?



From: Peter Skomoroch []
Sent: Monday, June 29, 2009 4:20 PM
Subject: Re: Set difference in Hive

Here is an example of what Amr mentioned from one of my Hive scripts, returns the set of pages
not in "daily_pagecounts_table"

select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews
FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id)
where dp.page_id is NULL
On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <<>>

do an outer join on user and filter on name.user is null

-- amr

Rakesh Setty wrote:


            I am new to Hive. I would like to know what is the easiest way to get the difference
between two sets. For example, how can I convert the following SQL query to Hive?

select user from page_views where user not in (select name from users);



Peter N. Skomoroch

View raw message