hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Merge sorting reduce output files
Date Wed, 29 Feb 2012 10:59:01 GMT

On Tue, Feb 28, 2012 at 23:28, Robert Evans <evans@yahoo-inc.com> wrote:

>  I am not sure I can help with that unless I know better what “a special
> distribution” means.

The thing is that this application is a "Auto Complete" feature that has a
key that is "the letters that have been typed so far".
Now for several reasons we need this to be sorted by length of the input.
So the '1 letter suggestions' first, then the '2 letter suggestions' etc.
I've been trying to come up with an automatic partitioning that would split
the dataset into something like 30 parts that when concatenated do what you

Unless you are doing a massive amount of processing in your reducer having
> a partition that is only close to balancing the distribution is a big win
> over all of the other options that put the data on a single machine and
> sort it there.  Even if you are doing a lot of processing in the reducer,
> or you need a special grouping to make the reduce work properly having a
> second map/reduce job to sort the data that is just close to balancing I
> would suspect would beat out all of the other options.

Thanks, this is a useful suggestion. I'll see if there is a pattern in the
data and from there simply manual define the partitions based on the
pattern we find.

Best regards / Met vriendelijke groeten,

Niels Basjes

View raw message