hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Merge sorting reduce output files
Date Thu, 01 Mar 2012 14:23:11 GMT

On Thu, Mar 1, 2012 at 00:07, Robert Evans <evans@yahoo-inc.com> wrote:

>  Sorry it has taken me so long to respond.  Today has been a very crazy
> day.

No worries.

> I am just guessing what your algorithm is for auto-complete.

What we have has a lot more features. Yet the basic idea of what we have is
similar enough to what you describe for this discussion.

>  If we want the keys to come out in sorted order, we need to have a
> sequence file with the partition keys for the total order partitioner.
> TeraSort generates a partition file by getting ....
> This only really works for Terasort because it assumes that all of the
> partitions are more or less random already.

And that is something I don't have.

> This is the case for the output of a typical map/reduce job where the
> reduce does not change the keys passed in and the output of the reducer is
> less then a block in size.  That sure sounds like what wordcount does to
> me.  The only real way to get around that is to do it as part of a
> map/reduce job, and do some random sampling instead of reading the first N.
>  It should be a map/reduce job because it is going to be reading a lot more
> data then TeraSort’s partition generation code.  In this case you would
> have a second M/R job that runs after the first and randomly samples
> words/phrases to work on.  It would then generate the increasing long
> phrases and send them all to a single reducer that would buffer them up,
> and when the Reducer has no more input it would output every Nth key so
> that you get the proper number of partitions for the Reducers.  You could
> sort these keys yourself to be sure, but they should come in in sorted
> order so why bother resorting.
> If my assumptions are totally wrong here please let me know.

I've had a discussion with some coworkers and we came to a possible
solution that is very closely related to your idea.
Because this is a job that runs periodically we think we can assume the
distribution of the dataset will have a similar "shape" from one run to the
If this assumption holds we can:
1) Create a job that takes the output of run 1 and create a aggregate that
can be used to partition the dataset
2) Use the partitioning dataset from '1)' to distribute the processing for
the next run.

Thanks for your suggestions.

Best regards / Met vriendelijke groeten,

Niels Basjes

View raw message