hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abc xyz <fabc_xyz...@yahoo.com>
Subject Re: Partitioned Datasets Map/Reduce
Date Mon, 05 Jul 2010 08:17:07 GMT

Thanks Aaron. The first option sounds good. 
How can I ensure to write the partition numbers in a single file while I am 
writing each partition to a separate  file? I mean, Ok after the custom 
partitioner, an identity reducer would work to write the part-xxxxx file for 
each partition, but how to write one single file by all reducers containing 
their partition numbers? Should I do it manually?
One possibility: write out all the partition numbers (one per line) to a
single file, then use the NLineInputFormat to make each line its own map
task. Then in your mapper itself, you will get in a key of "0" or "1" or "2"
etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your

If you wanted to be more clever, it might be possible to subclass
MultiFileInputFormat to group together both datasets "file-number-wise" when
generating splits, but I don't have specific guidance here.

- Aaron

On Sat, Jul 3, 2010 at 9:35 AM, abc xyz <fabc_xyz111@yahoo.com> wrote:

> Hello everyone,
> I have written my custom partitioner for partitioning datasets. I want  to
> partition two datasets using the same partitioner and then in the  next
> mapreduce job, I want each mapper to handle the same partition from  the
> two
> sources and perform some function such as joining etc. How I  can I ensure
> that
> one mapper gets the split that corresponds to same  partition from both the
> sources?
> Any help would be highly appreciated.


From: Aaron Kimball <aaron@cloudera.com>
To: common-user@hadoop.apache.org
Sent: Mon, July 5, 2010 8:51:44 AM
Subject: Re: Partitioned Datasets Map/Reduce

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message