As a point of order, you would normally use a combiner with this problem and
you wouldn't sort in either the combiner or the reducer. Instead, combiner
and reducer would simply scan and keep the smallest item to emit at the end
of the scan.
As a point of information, most of the rankbased statistics like min, max,
median and quantiles can be approximated in an online fashion with O(n)
time and O(1) storage.
Back to the question, though. Let's assume that you would still like to
sort the elements. The Hadoop sorts are typically mergesorts which can be
helped if you provide data in order. Thus, you should consider (and
probably test) providing a combiner that will sort partial results in order
to make the framework sorts run faster.
Another important consideration is whether the Hadoop shuffle and sort steps
will have to deserialize your data in order to sort it. If so, you will
almost certainly be better off doing the sort yourself. I don't know if
your combiner would be subject to the roundtrip serialization cost, but I
wouldn't be surprised.
On Mon, Jan 3, 2011 at 2:59 PM, W.P. McNeill <billmcn@gmail.com> wrote:
> Say I have a set of unordered sets of integers:
>
> A: {2,5,7}
> B: {6,1,9}
> C: {3,8,2,1,6}
>
> I want to use map/reduce to emit the smallest integer in each set. If my
> input data looks like this:
>
> A 2
> A 5
> A 7
> B 6
> B 1
> ...etc...
>
> I could use an identity mapper and a reducer like the following
>
> Reduce(setID, [e0, e1, ... ]):
> a = sort [e0, e1, ... ]
> Emit(setID, a[0])
>
> Using standard sort algorithms, this has runtime efficiency of O(N log N),
> where N is the length of [e0, e1, ... ].
>
> I can write a custom partitioner and grouper to do a secondary
> sort<
> http://sonerbalkir.blogspot.com/2010/01/simulatingsecondarysortonvalues.html
> >,
> so that Hadoop sees to it that [e0, e1, ... ] comes into my reducer already
> sorted. When I do this my reducer becomes simply:
>
> Reduce(setID, [e0, e1, ... ]):
> Emit(setID, e0)
>
> I understand that this makes things faster because I'm parallelizing the
> sort work, but how much faster? Specifically is my runtime efficiency now
> O(1), amortized O(1), some function of the cluster size, or still O(N log
> N)
> but with smaller constant factors?
>
> I think (but am not 100% sure) that this is equivalent to the question,
> "What is the runtime efficiency of mapreduce sort"?
>
> Also, is there an academic paper with this information that I could cite?
>
> Usual Google searches and manual perusals were fruitless. Thanks for your
> help.
>
