hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <>
Subject RE: Efficient Equality Joins of Large Tables
Date Tue, 10 Jun 2014 13:38:10 GMT
I agree that the originally request is not very clear. 
>From my understanding, the reference_id is very unique in both Ad load and Ad click tables,
but both tables could contain huge amount of data. (But in theory, click table should be much
smaller than the load table, right? But let's just assume that click table is also big enough
to not fit into memory).
In this case, you have to use reducer side join. I don't know any other trick can make it
faster. If you have millions (or even billions) of unique reference_ids in both tables, you
just need a lot of reducers, as the reference_id will be your key.
Make sure you give the query all the reducers available in your cluster, and make sure MR
job use them all. The default HashPartitioner should generate enough reducer groups, if your
unique reference_id count is large enough.
If there are other optimize ways, I am happy to hear it.

Date: Tue, 10 Jun 2014 10:16:53 +0200
Subject: Re: Efficient Equality Joins of Large Tables

Hi Mark,
I still don't understand if you are trying to perform your join for one specific known reference_id
or for all of them?In the first case, you should apply a pre-filter on each table first with
subqueries, and this would leave you with only a few rows for the join.

In the second case, you could try applying a GROUP BY on each table first (again with subqueries)
which would leave you with a few rows for the join too.This second method might work or not
depending on what you are trying to compute. 

For complex GROUP BY operations, cleverly combining the UDFs collect_set with split might
help too...
If it does't work for you, then I would need more details on what you are willing to compute


2014-06-09 20:30 GMT+02:00 Mark Desnoyer <>:

Hi Furcy,
Thanks for the reply. I looked at MapJoin but it won't do what I need because all the tables
will be large and actually, explicitly going through the entire table in an n^2 fashion is
very inefficient.

I have large tables, but the intersection is very small. In the Ad Click case, I would have
two streams of data: one for loads and one for clicks. When a page is rendered, a unique reference
id is generated, which gets tagged to the load and any click that might happen. So, if I had
two tables, one for the loads and one for the clicks, the reference id in each table could
be used to link the click that happened to its associated load. In other words, for a given
reference id, there is probably only one load and a small number of clicks. So, my join would
look some like:

SELECT <some cols> FROM loads LEFT JOIN clicks ON loads.reference_id = clicks.reference_id;
In the MapReduce framework, I could use that reference id as the map output key in order to
collect those events that are associated with each other and then the reducer would generate
the rows by doing the join with the small number of rows that it received. Technically, you
could take this approach for any equality join because you're using the partitioning step
in order to enforce the join predicate instead of actually evaluating it. This could be much
more efficient in some cases, so I was wondering if Hive can do it and if so, how can I trigger


On Mon, Jun 9, 2014 at 10:34 AM, Furcy Pin <> wrote:

Hi Mark,
I'm not sure if I understand what your trying to do correctly, do you know the reference id
on which you want to do the join beforehand? 

Or is one of your tables small? 

Or are they all big with a small intersection?
I you haven't yet, I would suggest you to have a look at MapJoin:

Mapjoins allow hive to perform a join on the map side by putting the small table in cache.
(depending on your hive version, this might or might not be enabled by default)

Depending on your use case, you might have to pre-filter some tables, for that you can use
sub-queries like:

FROM (   SELECT id FROM T WHERE id = myID ) T1

Unlike a standard sql that would simplifies the query, I believe this will force Hive to perform
the sub-select with a first mapreduce, and apply the join in the second

(as a mapjoin if the result of the subquery is small enough).
Hope this helps,

2014-06-09 18:48 GMT+02:00 Mark Desnoyer <>:


I was wondering if there was a way in Hive to trigger it to perform an efficient equality
join on large tables? Specifically, I have two or more tables where the joined key is relatively
rare in each table. A good example would be an AdClick scenario where you would have two tables,
one for ad loads and one for ad clicks, where there is a reference id to connect a click to
a specific ad load and that's the key to join on. 

In the MapReduce framework, this join could be done efficiently by using the reference id
as the key from the map output and then the join would be done in the reducer since there
are a small number of rows associated with each reference id. However, hive implements join
by buffering all but the last tables in the reducer, and then streams the last table through.
This works great if the first tables are relatively small, but if the tables are large and
the associated rows are rare, it's very inefficient.

So, can hive perform the join the first way I described? If so, any idea how I trigger it?
Or do I have to write my own MR job?


View raw message