hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: Merge sorting reduce output files
Date Wed, 29 Feb 2012 23:07:13 GMT
Niels,

Sorry it has taken me so long to respond.  Today has been a very crazy day.

I am just guessing what your algorithm is for auto-complete.  I really don't know so I will
just design a back of the envelope one myself as a starting point.  My guess is that you have
a few map/reduce jobs.  The first M/R Job is mostly a glorified word count to get a word or
phrase with how often it is searched for.  In the next job the  map splits the phrases up
so that they are output with an ever increasing number of letters as the key along with the
original phrase and its weight as the value.  The Reducer/Combiner groups them by the key
and produces a top N list of phrases that have the highest weights for each key.  If we want
the keys to come out in sorted order, we need to have a sequence file with the partition keys
for the total order partitioner.

TeraSort generates a partition file by getting the number of splits and then reading the first
N records from each split where N is based off of the number of samples desired and the number
of splits.  The keys for all of the sampled entries are sorted and divided into mostly equal
length partitions that are stored in the partition file.  This only really works for Terasort
because it assumes that all of the partitions are more or less random already.  The worst
input dataset to TeraSort would be one where each partition is sorted internally, but made
up of fairly evenly distributed data.  This is the case for the output of a typical map/reduce
job where the reduce does not change the keys passed in and the output of the reducer is less
then a block in size.  That sure sounds like what wordcount does to me.  The only real way
to get around that is to do it as part of a map/reduce job, and do some random sampling instead
of reading the first N.  It should be a map/reduce job because it is going to be reading a
lot more data then TeraSort's partition generation code.  In this case you would have a second
M/R job that runs after the first and randomly samples words/phrases to work on.  It would
then generate the increasing long phrases and send them all to a single reducer that would
buffer them up, and when the Reducer has no more input it would output every Nth key so that
you get the proper number of partitions for the Reducers.  You could sort these keys yourself
to be sure, but they should come in in sorted order so why bother resorting.

If my assumptions are totally wrong here please let me know.

--Bobby Evans

On 2/29/12 4:59 AM, "Niels Basjes" <Niels@basjes.nl> wrote:

Robert,

On Tue, Feb 28, 2012 at 23:28, Robert Evans <evans@yahoo-inc.com> wrote:
I am not sure I can help with that unless I know better what "a special distribution" means.

The thing is that this application is a "Auto Complete" feature that has a key that is "the
letters that have been typed so far".
Now for several reasons we need this to be sorted by length of the input. So the '1 letter
suggestions' first, then the '2 letter suggestions' etc.
I've been trying to come up with an automatic partitioning that would split the dataset into
something like 30 parts that when concatenated do what you suggest.

Unless you are doing a massive amount of processing in your reducer having a partition that
is only close to balancing the distribution is a big win over all of the other options that
put the data on a single machine and sort it there.  Even if you are doing a lot of processing
in the reducer, or you need a special grouping to make the reduce work properly having a second
map/reduce job to sort the data that is just close to balancing I would suspect would beat
out all of the other options.

Thanks, this is a useful suggestion. I'll see if there is a pattern in the data and from there
simply manual define the partitions based on the pattern we find.

Mime
View raw message