hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denim Live <denim.l...@yahoo.com>
Subject Re: parititioning dataset
Date Tue, 06 Jul 2010 09:37:15 GMT
Yes it makes sense to do the join on reduce-side but I want the other way round. One option
can be something like this which someone from cloudera suggested: "write out all the partition
numbers (one per line) to a
single file, then use the NLineInputFormat to make each line its own map
task. Then in your mapper itself, you will get in a key of "0" or "1" or "2"
etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your

This is one option. Any other suggestions are welcomed.

From: Alex Loddengaard <alex@cloudera.com>
To: mapreduce-user@hadoop.apache.org
Sent: Mon, July 5, 2010 7:16:02 PM
Subject: Re: parititioning dataset

Hi there, 

Unfortunately you can't control which mapper gets what data.  The InputSplit -> map task
assignment is random.  You could, however, do the join in the reduce, by using an intermediate
key as your join key.

Does that make sense?


On Sat, Jul 3, 2010 at 9:28 AM, Denim Live <denim.live@yahoo.com> wrote:

Hello everyone,
>I have written my custom partitioner for partitioning datasets. I want to partition two
datasets using the same partitioner and then in the next mapreduce job, I want each mapper
to handle the same partition from the two sources and perform some function such as joining
etc. How I can I ensure that one mapper gets the split that corresponds to same partition
from both the sources? 
>Any help would be highly appreciated.

View raw message