hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sarfraz Ramay <>
Subject Re: Efficient Equality Joins of Large Tables
Date Mon, 09 Jun 2014 18:46:58 GMT
May be UDF could solve your problem.

Sarfraz Rasheed Ramay (DIT)
Dublin, Ireland.

On Mon, Jun 9, 2014 at 7:30 PM, Mark Desnoyer <> wrote:

> Hi Furcy,
> Thanks for the reply. I looked at MapJoin but it won't do what I need
> because all the tables will be large and actually, explicitly going through
> the entire table in an n^2 fashion is very inefficient.
> I have large tables, but the intersection is very small. In the Ad Click
> case, I would have two streams of data: one for loads and one for clicks.
> When a page is rendered, a unique reference id is generated, which gets
> tagged to the load and any click that might happen. So, if I had two
> tables, one for the loads and one for the clicks, the reference id in each
> table could be used to link the click that happened to its associated load.
> In other words, for a given reference id, there is probably only one load
> and a small number of clicks. So, my join would look some like:
> SELECT <some cols> FROM loads LEFT JOIN clicks ON loads.reference_id =
> clicks.reference_id;
> In the MapReduce framework, I could use that reference id as the map
> output key in order to collect those events that are associated with each
> other and then the reducer would generate the rows by doing the join with
> the small number of rows that it received. Technically, you could take this
> approach for any equality join because you're using the partitioning step
> in order to enforce the join predicate instead of actually evaluating it.
> This could be much more efficient in some cases, so I was wondering if Hive
> can do it and if so, how can I trigger it?
> Cheers,
> Mark
> On Mon, Jun 9, 2014 at 10:34 AM, Furcy Pin <> wrote:
>> Hi Mark,
>> I'm not sure if I understand what your trying to do correctly, do you
>> know the reference id on which you want to do the join beforehand?
>> Or is one of your tables small?
>> Or are they all big with a small intersection?
>> I you haven't yet, I would suggest you to have a look at MapJoin:
>> Mapjoins allow hive to perform a join on the map side by putting the
>> small table in cache.
>> (depending on your hive version, this might or might not be enabled by
>> default)
>>  Depending on your use case, you might have to pre-filter some tables,
>> for that you can use sub-queries like:
>> (
>>    SELECT id FROM T WHERE id = myID
>> ) T1
>> Unlike a standard sql that would simplifies the query, I believe this
>> will force Hive to perform the sub-select with a first mapreduce, and apply
>> the join in the second
>> (as a mapjoin if the result of the subquery is small enough).
>> Hope this helps,
>> Furcy
>> 2014-06-09 18:48 GMT+02:00 Mark Desnoyer <>:
>> Hi,
>>> I was wondering if there was a way in Hive to trigger it to perform an
>>> efficient equality join on large tables? Specifically, I have two or more
>>> tables where the joined key is relatively rare in each table. A good
>>> example would be an AdClick scenario where you would have two tables, one
>>> for ad loads and one for ad clicks, where there is a reference id to
>>> connect a click to a specific ad load and that's the key to join on.
>>> In the MapReduce framework, this join could be done efficiently by using
>>> the reference id as the key from the map output and then the join would be
>>> done in the reducer since there are a small number of rows associated with
>>> each reference id. However, hive implements join by buffering all but the
>>> last tables in the reducer, and then streams the last table through. This
>>> works great if the first tables are relatively small, but if the tables are
>>> large and the associated rows are rare, it's very inefficient.
>>> So, can hive perform the join the first way I described? If so, any idea
>>> how I trigger it? Or do I have to write my own MR job?
>>> Cheers,
>>> Mark

View raw message