hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Judd" <d...@zvents.com>
Subject Re: What do people use Hadoop for?
Date Thu, 25 Jan 2007 21:44:39 GMT
Comments inline ...

On 1/25/07, Bryan A. P. Pendleton <bp@geekdom.net> wrote:
> If the output is already sorted, the sort pass *should* be able to run in
> linear time - perhaps not worth optimizing it out for cases of sorted
> output.

Agreed.  There's no reason why the framework can't detect this and
automatically skip the n log(n) sort.

Given the scatter/reassemble nature of the default map/reduce
> (scatter by Partition, by default by the Hash), inputs that are sorted may
> not be written as such to output..... so, if you're counting on sorted
> data,
> maybe it's best to leave the sort in (and verify that the current
> infrastructure will perform well given sorted input). Otherwise, if there
> is
> no implication/need of sorted output, then sort can be totally disabled.

I guess when I say "sorted" I really mean aggregated.  If you know the input
is "aggregated" (e.g. it's the output of a previous Map/Reduce job) and the
map() function preserves this aggregation, then the step of building the
intermediate results from the map output should be done in linear time.

I do feel that this optimization is important.  As Adrezej points out, there
are quite a few applications that could benefit from this.  In particular,
the join example.  Without it, you can't really justify replacing your large
RDBMS system with a Map-Reduce (Bigtable) implementation.

 - Doug

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message