hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <chri...@yahoo-inc.com>
Subject Re: MapSide Join and left outer or right outer joins?
Date Thu, 03 Jul 2008 20:09:35 GMT
Sorry, I meant splits (partitions of input data). If you have n  
datasets and m splits per dataset, m_i must contain the same keys for  
all n. So if you're joining two datasets A and B sharing a key k, if  
split i from A contains any instances of k, (a) split i from A must  
contain all instances of k from A and (b) split i from B must contain  
all instances of k from B. -C

On Jul 3, 2008, at 12:06 PM, Jason Venner wrote:

> We are using the default partitioner. I am just about to start  
> verifying my result as it took quite a while to work my way through  
> the in-obvious issues of hand writing MapFiles, thinks like the key  
> and value class are extracted from the jobconf, output key/value.
> Question: I looked at the HashPartitioner (which we are using) and a  
> key's partition is simply based on the key.hashCode() %  
> conf.getNumReduces().
> How will I get equal keys going to different partitions - clearly I  
> have an understanding gap.
> Thanks!
> Chris Douglas wrote:
>> Forgive me if you already know this, but the correctness of the map- 
>> side join is very sensitive to partitioning; if your input in  
>> sorted but equal keys go to different partitions, your results may  
>> be incorrect. Is your input such that the default partitioning is  
>> sufficient? Have you verified the correctness of your results? -C
>> On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:
>>> For the data joins, I let the framework do it - which means one  
>>> partition per split - so I have to chose my partition count  
>>> carefully to fill the machines.
>>> I had an error in my initial outer join mapper, the join map code  
>>> now runs about 40x faster than the old brute force read it all  
>>> shuffle & sort.
>>> Chris Douglas wrote:
>>>> Hi Jason-
>>>>> It only seems like full outer or full inner joins are supported.  
>>>>> I was hoping to just do a left outer join.
>>>>> Is this supported or planned?
>>>> The full inner/outer joins are examples, really. You can define  
>>>> your own operations by extending  
>>>> o.a.h.mapred.join.JoinRecordReader or  
>>>> o.a.h.mapred.join.MultiFilterRecordReader and registering your  
>>>> new identifier with the parser by defining a property  
>>>> "mapred.join.define.<ident>" as your class.
>>>> For a left outer join, JoinRecordReader is the correct base.  
>>>> InnerJoinRecordReader and OuterJoinRecordReader should make its  
>>>> use clear.
>>>>> On the flip side doing the Outer Join is about 8x faster than  
>>>>> doing a map/reduce over our dataset.
>>>> Cool! Out of curiosity, how are you managing your splits? -C
> -- 
> Jason Venner
> Attributor - Program the Web <http://www.attributor.com/>
> Attributor is hiring Hadoop Wranglers and coding wizards, contact if  
> interested

View raw message