hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason hadoop <jason.had...@gmail.com>
Subject Re: Map-side join: Sort order preserved?
Date Thu, 14 May 2009 15:25:05 GMT
Sort order is preserved if your Mapper doesn't change the key ordering in
output. Partition name is not preserved.

What I have done is to manually work out what the partition number of the
output file should be for each map task, by calling the partitioner on an
input key, and then renaming the output in the close method.

Conceptually the place for this dance is in the OutputCommitter, but I
haven't used them in production code, and my mapside join examples come from
before they were available.

the Hadoop join framework handles setting the split size to Long.MAX_VALUE
for you.

If you put up a discussion question on www.prohadoopbook.com, I will fill in
the example on how to do this.

On Thu, May 14, 2009 at 8:04 AM, Stuart White <stuart.white1@gmail.com>wrote:

> I'm implementing a map-side join as described in chapter 8 of "Pro
> Hadoop".  I have two files that have been partitioned using the
> TotalOrderPartitioner on the same key into the same number of
> partitions.  I've set mapred.min.split.size to Long.MAX_VALUE so that
> one Mapper will handle an entire partition.
> I want the output to be written in the same partitioned, total sort
> order.  If possible, I want to accomplish this by setting my
> NumReducers to 0 and having the output of my Mappers written directly
> to HDFS, thereby skipping the partition/sort step.
> My question is this: Am I guaranteed that the Mapper that processes
> part-00000 will have its output written to the output file named
> part-00000, the Mapper that processes part-00001 will have its output
> written to part-00001, etc... ?
> If so, then I can preserve the partitioning/sort order of my input
> files without re-partitioning and re-sorting.
> Thanks.

Alpha Chapters of my book on Hadoop are available
www.prohadoopbook.com a community for Hadoop Professionals

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message