hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Re: MapSide Join and left outer or right outer joins?
Date Thu, 03 Jul 2008 04:55:35 GMT
For the data joins, I let the framework do it - which means one 
partition per split - so I have to chose my partition count carefully to 
fill the machines.

I had an error in my initial outer join mapper, the join map code now 
runs about 40x faster than the old brute force read it all shuffle & sort.

Chris Douglas wrote:
> Hi Jason-
>
>> It only seems like full outer or full inner joins are supported. I 
>> was hoping to just do a left outer join.
>>
>> Is this supported or planned?
>
>
> The full inner/outer joins are examples, really. You can define your 
> own operations by extending o.a.h.mapred.join.JoinRecordReader or 
> o.a.h.mapred.join.MultiFilterRecordReader and registering your new 
> identifier with the parser by defining a property 
> "mapred.join.define.<ident>" as your class.
>
> For a left outer join, JoinRecordReader is the correct base. 
> InnerJoinRecordReader and OuterJoinRecordReader should make its use 
> clear.
>
>> On the flip side doing the Outer Join is about 8x faster than doing a 
>> map/reduce over our dataset.
>
> Cool! Out of curiosity, how are you managing your splits? -C

Mime
View raw message