hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sarfraz Ramay <sarfraz.ra...@gmail.com>
Subject Re: Efficient Equality Joins of Large Tables
Date Mon, 09 Jun 2014 18:46:58 GMT
May be UDF could solve your problem.

Regards,
Sarfraz Rasheed Ramay (DIT)
Dublin, Ireland.


On Mon, Jun 9, 2014 at 7:30 PM, Mark Desnoyer <desnoyer@neon-lab.com> wrote:

> Hi Furcy,
>
> Thanks for the reply. I looked at MapJoin but it won't do what I need
> because all the tables will be large and actually, explicitly going through
> the entire table in an n^2 fashion is very inefficient.
>
> I have large tables, but the intersection is very small. In the Ad Click
> case, I would have two streams of data: one for loads and one for clicks.
> When a page is rendered, a unique reference id is generated, which gets
> tagged to the load and any click that might happen. So, if I had two
> tables, one for the loads and one for the clicks, the reference id in each
> table could be used to link the click that happened to its associated load.
> In other words, for a given reference id, there is probably only one load
> and a small number of clicks. So, my join would look some like:
>
> SELECT <some cols> FROM loads LEFT JOIN clicks ON loads.reference_id =
> clicks.reference_id;
>
> In the MapReduce framework, I could use that reference id as the map
> output key in order to collect those events that are associated with each
> other and then the reducer would generate the rows by doing the join with
> the small number of rows that it received. Technically, you could take this
> approach for any equality join because you're using the partitioning step
> in order to enforce the join predicate instead of actually evaluating it.
> This could be much more efficient in some cases, so I was wondering if Hive
> can do it and if so, how can I trigger it?
>
> Cheers,
> Mark
>
>
> On Mon, Jun 9, 2014 at 10:34 AM, Furcy Pin <furcy.pin@flaminem.com> wrote:
>
>> Hi Mark,
>>
>> I'm not sure if I understand what your trying to do correctly, do you
>> know the reference id on which you want to do the join beforehand?
>> Or is one of your tables small?
>> Or are they all big with a small intersection?
>>
>> I you haven't yet, I would suggest you to have a look at MapJoin:
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>> Mapjoins allow hive to perform a join on the map side by putting the
>> small table in cache.
>> (depending on your hive version, this might or might not be enabled by
>> default)
>>
>>  Depending on your use case, you might have to pre-filter some tables,
>> for that you can use sub-queries like:
>>
>> FROM
>> (
>>    SELECT id FROM T WHERE id = myID
>> ) T1
>> JOIN T2 ON T1.id=T2.id
>> SELECT *
>>
>> Unlike a standard sql that would simplifies the query, I believe this
>> will force Hive to perform the sub-select with a first mapreduce, and apply
>> the join in the second
>> (as a mapjoin if the result of the subquery is small enough).
>>
>> Hope this helps,
>>
>> Furcy
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2014-06-09 18:48 GMT+02:00 Mark Desnoyer <desnoyer@neon-lab.com>:
>>
>> Hi,
>>>
>>> I was wondering if there was a way in Hive to trigger it to perform an
>>> efficient equality join on large tables? Specifically, I have two or more
>>> tables where the joined key is relatively rare in each table. A good
>>> example would be an AdClick scenario where you would have two tables, one
>>> for ad loads and one for ad clicks, where there is a reference id to
>>> connect a click to a specific ad load and that's the key to join on.
>>>
>>> In the MapReduce framework, this join could be done efficiently by using
>>> the reference id as the key from the map output and then the join would be
>>> done in the reducer since there are a small number of rows associated with
>>> each reference id. However, hive implements join by buffering all but the
>>> last tables in the reducer, and then streams the last table through. This
>>> works great if the first tables are relatively small, but if the tables are
>>> large and the associated rows are rare, it's very inefficient.
>>>
>>> So, can hive perform the join the first way I described? If so, any idea
>>> how I trigger it? Or do I have to write my own MR job?
>>>
>>> Cheers,
>>> Mark
>>>
>>
>>
>

Mime
View raw message