hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: What is the runtime efficiency of secondary sorting?
Date Mon, 03 Jan 2011 23:12:44 GMT
As a point of order, you would normally use a combiner with this problem and
you wouldn't sort in either the combiner or the reducer.  Instead, combiner
and reducer would simply scan and keep the smallest item to emit at the end
of the scan.

As a point of information, most of the rank-based statistics like min, max,
median and quantiles can be approximated in an on-line fashion with O(n)
time and O(1) storage.

Back to the question, though.  Let's assume that you would still like to
sort the elements.  The Hadoop sorts are typically merge-sorts which can be
helped if you provide data in order.  Thus, you should consider (and
probably test) providing a combiner that will sort partial results in order
to make the framework sorts run faster.

Another important consideration is whether the Hadoop shuffle and sort steps
will have to deserialize your data in order to sort it.  If so, you will
almost certainly be better off doing the sort yourself.  I don't know if
your combiner would be subject to the round-trip serialization cost, but I
wouldn't be surprised.

On Mon, Jan 3, 2011 at 2:59 PM, W.P. McNeill <billmcn@gmail.com> wrote:

> Say I have a set of unordered sets of integers:
> A: {2,5,7}
> B: {6,1,9}
> C: {3,8,2,1,6}
> I want to use map/reduce to emit the smallest integer in each set.  If my
> input data looks like this:
> A    2
> A    5
> A    7
> B    6
> B    1
> ...etc...
> I could use an identity mapper and a reducer like the following
> Reduce(setID, [e0, e1, ... ]):
>    a = sort [e0, e1, ... ]
>    Emit(setID, a[0])
> Using standard sort algorithms, this has runtime efficiency of O(N log N),
> where N is the length of [e0, e1, ... ].
> I can write a custom partitioner and grouper to do a secondary
> sort<
> http://sonerbalkir.blogspot.com/2010/01/simulating-secondary-sort-on-values.html
> >,
> so that Hadoop sees to it that [e0, e1, ... ] comes into my reducer already
> sorted.  When I do this my reducer becomes simply:
> Reduce(setID, [e0, e1, ... ]):
>    Emit(setID, e0)
> I understand that this makes things faster because I'm parallelizing the
> sort work, but how much faster?  Specifically is my runtime efficiency now
> O(1), amortized O(1), some function of the cluster size, or still O(N log
> N)
> but with smaller constant factors?
> I think (but am not 100% sure) that this is equivalent to the question,
> "What is the runtime efficiency of map-reduce sort"?
> Also, is there an academic paper with this information that I could cite?
> Usual Google searches and manual perusals were fruitless.  Thanks for your
> help.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message