hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From amcn...@mcnabbs.org (Andrew McNabb)
Subject Re: What do people use Hadoop for?
Date Thu, 25 Jan 2007 01:56:08 GMT
On Wed, Jan 24, 2007 at 05:32:20PM -0800, Doug Judd wrote:
> After digging into this a bit, it looks like the use of IdentityReducer does
> not disable the sort.  I wrote a simple Map/Reduce program that uses
> /usr/share/dict/words as input and generates keys that are a Text
> representation of the CRC of the word modulo 65536 and values that are the
> word itself.  I set the reducer to be the IdentityReducer and the output
> came out sorted:

It doesn't disable the sort, but Andrzej's comment still holds:

> >:) Sure, that's one point of view on this - however, in quite a few
> >applications sort is definitely less important than the ability to
> >split the processing load in map() and reduce() over many machines.
> >Sometimes I don't care about the sorting at all (in all cases where
> >IdentityReducer is used).

When you say "MapReduce is just distributed sort," that makes it sound
like people use MapReduce because they want a distributed sort.  The
fact of the matter is that there are plenty of other reasons to use
MapReduce, including load balancing, fault tolerence, etc.  In the
majority of the cases where a sort is needed, it's really just an
implementation detail; if the majority of the work being done is in the
map function, it doesn't make sense to put so much importance on the
sort that follows it.

In general, I don't agree with your statement.  However, if you need to
sort a large number of items, running a MapReduce job with identity map
and identity reduce would be a very simple way to do it.

Andrew McNabb
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

View raw message