hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rory Sawyer <rsaw...@mediamath.com>
Subject Re: Cross join/cartesian product explanation
Date Fri, 13 Nov 2015 15:18:05 GMT
Hi Gopal,

Thanks for the detailed response.

It’s really a very simple query that I’m trying to run:
select
    a.a_id,
    b.b_id,
    count(*) as c
from
    table_a a, 
    table_b b
where
    bloom_contains(a_id, b_id_bloom)
group by
    a.a_id,
    b.b_id;

Where “bloom_contains” is a custom UDF. The only changes I made were renaming the tables
and columns. The sizes of the tables I’m running against are small — roughly 50-100Mb
— but this query would need to be expanded to run on a table that is >100Gb (table_b
would likely max out around 100Mb).

Any suggestions on how to approach this would be greatly appreciated.

Best,
Rory
Mime
View raw message