hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Re: MapSide Join and left outer or right outer joins?
Date Thu, 03 Jul 2008 20:28:17 GMT
Ahh yes, it is absolutely critical that the partitioning in all of the 
sets be the same :)
We are currently assuming that:
1) Default Partitioner
2) Same Reduce Count
3) Text keys
will guarantee that, but as I mentioned above, we are assuming ;)

Testing is just starting

Chris Douglas wrote:
> Sorry, I meant splits (partitions of input data). If you have n 
> datasets and m splits per dataset, m_i must contain the same keys for 
> all n. So if you're joining two datasets A and B sharing a key k, if 
> split i from A contains any instances of k, (a) split i from A must 
> contain all instances of k from A and (b) split i from B must contain 
> all instances of k from B. -C
>
> On Jul 3, 2008, at 12:06 PM, Jason Venner wrote:
>
>> We are using the default partitioner. I am just about to start 
>> verifying my result as it took quite a while to work my way through 
>> the in-obvious issues of hand writing MapFiles, thinks like the key 
>> and value class are extracted from the jobconf, output key/value.
>>
>> Question: I looked at the HashPartitioner (which we are using) and a 
>> key's partition is simply based on the key.hashCode() % 
>> conf.getNumReduces().
>> How will I get equal keys going to different partitions - clearly I 
>> have an understanding gap.
>>
>> Thanks!
>>
>>
>> Chris Douglas wrote:
>>> Forgive me if you already know this, but the correctness of the 
>>> map-side join is very sensitive to partitioning; if your input in 
>>> sorted but equal keys go to different partitions, your results may 
>>> be incorrect. Is your input such that the default partitioning is 
>>> sufficient? Have you verified the correctness of your results? -C
>>>
>>> On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:
>>>
>>>> For the data joins, I let the framework do it - which means one 
>>>> partition per split - so I have to chose my partition count 
>>>> carefully to fill the machines.
>>>>
>>>> I had an error in my initial outer join mapper, the join map code 
>>>> now runs about 40x faster than the old brute force read it all 
>>>> shuffle & sort.
>>>>
>>>> Chris Douglas wrote:
>>>>> Hi Jason-
>>>>>
>>>>>> It only seems like full outer or full inner joins are supported.

>>>>>> I was hoping to just do a left outer join.
>>>>>>
>>>>>> Is this supported or planned?
>>>>>
>>>>>
>>>>> The full inner/outer joins are examples, really. You can define 
>>>>> your own operations by extending 
>>>>> o.a.h.mapred.join.JoinRecordReader or 
>>>>> o.a.h.mapred.join.MultiFilterRecordReader and registering your new 
>>>>> identifier with the parser by defining a property 
>>>>> "mapred.join.define.<ident>" as your class.
>>>>>
>>>>> For a left outer join, JoinRecordReader is the correct base. 
>>>>> InnerJoinRecordReader and OuterJoinRecordReader should make its 
>>>>> use clear.
>>>>>
>>>>>> On the flip side doing the Outer Join is about 8x faster than 
>>>>>> doing a map/reduce over our dataset.
>>>>>
>>>>> Cool! Out of curiosity, how are you managing your splits? -C
>>>
>> -- 
>> Jason Venner
>> Attributor - Program the Web <http://www.attributor.com/>
>> Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
>> interested
>
-- 
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message