hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Re: MapSide Join and left outer or right outer joins?
Date Thu, 03 Jul 2008 19:06:25 GMT
We are using the default partitioner. I am just about to start verifying 
my result as it took quite a while to work my way through the in-obvious 
issues of hand writing MapFiles, thinks like the key and value class are 
extracted from the jobconf, output key/value.

Question: I looked at the HashPartitioner (which we are using) and a 
key's partition is simply based on the key.hashCode() % 
 How will I get equal keys going to different partitions - clearly I 
have an understanding gap.


Chris Douglas wrote:
> Forgive me if you already know this, but the correctness of the 
> map-side join is very sensitive to partitioning; if your input in 
> sorted but equal keys go to different partitions, your results may be 
> incorrect. Is your input such that the default partitioning is 
> sufficient? Have you verified the correctness of your results? -C
> On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:
>> For the data joins, I let the framework do it - which means one 
>> partition per split - so I have to chose my partition count carefully 
>> to fill the machines.
>> I had an error in my initial outer join mapper, the join map code now 
>> runs about 40x faster than the old brute force read it all shuffle & 
>> sort.
>> Chris Douglas wrote:
>>> Hi Jason-
>>>> It only seems like full outer or full inner joins are supported. I 
>>>> was hoping to just do a left outer join.
>>>> Is this supported or planned?
>>> The full inner/outer joins are examples, really. You can define your 
>>> own operations by extending o.a.h.mapred.join.JoinRecordReader or 
>>> o.a.h.mapred.join.MultiFilterRecordReader and registering your new 
>>> identifier with the parser by defining a property 
>>> "mapred.join.define.<ident>" as your class.
>>> For a left outer join, JoinRecordReader is the correct base. 
>>> InnerJoinRecordReader and OuterJoinRecordReader should make its use 
>>> clear.
>>>> On the flip side doing the Outer Join is about 8x faster than doing 
>>>> a map/reduce over our dataset.
>>> Cool! Out of curiosity, how are you managing your splits? -C
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message