hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <chri...@yahoo-inc.com>
Subject Re: MapSide Join and left outer or right outer joins?
Date Thu, 03 Jul 2008 18:56:54 GMT
Forgive me if you already know this, but the correctness of the map- 
side join is very sensitive to partitioning; if your input in sorted  
but equal keys go to different partitions, your results may be  
incorrect. Is your input such that the default partitioning is  
sufficient? Have you verified the correctness of your results? -C

On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:

> For the data joins, I let the framework do it - which means one  
> partition per split - so I have to chose my partition count  
> carefully to fill the machines.
>
> I had an error in my initial outer join mapper, the join map code  
> now runs about 40x faster than the old brute force read it all  
> shuffle & sort.
>
> Chris Douglas wrote:
>> Hi Jason-
>>
>>> It only seems like full outer or full inner joins are supported. I  
>>> was hoping to just do a left outer join.
>>>
>>> Is this supported or planned?
>>
>>
>> The full inner/outer joins are examples, really. You can define  
>> your own operations by extending o.a.h.mapred.join.JoinRecordReader  
>> or o.a.h.mapred.join.MultiFilterRecordReader and registering your  
>> new identifier with the parser by defining a property  
>> "mapred.join.define.<ident>" as your class.
>>
>> For a left outer join, JoinRecordReader is the correct base.  
>> InnerJoinRecordReader and OuterJoinRecordReader should make its use  
>> clear.
>>
>>> On the flip side doing the Outer Join is about 8x faster than  
>>> doing a map/reduce over our dataset.
>>
>> Cool! Out of curiosity, how are you managing your splits? -C


Mime
View raw message