hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arkady Borkovsky <ark...@yahoo-inc.com>
Subject Re: What do people use Hadoop for?
Date Thu, 25 Jan 2007 18:08:41 GMT
"Disabling the sort"  == "map without reduce" == "map writes the output 
into DFS"
is indeed a very useful and desirable feature.
File a JIRA issue.


On Jan 24, 2007, at 5:32 PM, Doug Judd wrote:

> After digging into this a bit, it looks like the use of 
> IdentityReducer does
> not disable the sort.  I wrote a simple Map/Reduce program that uses
> /usr/share/dict/words as input and generates keys that are a Text
> representation of the CRC of the word modulo 65536 and values that are 
> the
> word itself.  I set the reducer to be the IdentityReducer and the 
> output
> came out sorted:
>
> 0       apperceptively
> 0       Connarus
> 1       overfold
> 1       derationalization
> 1       gymnasium
> 10      respecting
> 10      supperwards
> 100     cellulofibrous
> 100     drogherman
> 100     heteroptics
> 1000    bacao
> 1000    Cumaean
> 1000    didymate
> 1000    disbelieving
> 1001    polymer
> 1001    salveline
> 1001    workwomanly
> 1002    sporty
> 1002    bakal
> 1003    preferentialist
>
> Also, after reviewing the Google paper, they make no mention of the 
> sort
> being disabled by the Identity reducer.  In fact, they describe their 
> Sort
> implementation as using the identity reducer.
>
> Unless I'm missing something, I retract my previous statement.  
> Map-Reduce
> is really just distributed sort.  I do think that being able to 
> disable the
> sort is a much needed enhancement, especially since quite a few 
> applications
> don't need it.
>
> - Doug
>
> On 1/24/07, Andrzej Bialecki <ab@getopt.org> wrote:
>>
>> Doug Judd wrote:
>> > Part of the problem is that calling the paradigm "Map-Reduce" is
>> somewhat
>> > misleading.  It is really just a distributed sort.  The sort is 
>> where
>> > all of
>> > the complexity comes from.  Invoking map() over the input is O(n),
>> > invoking
>> > reduce() over the intermediate results is O(n) as well.  The sort is
>> > O(nlogn).  A more appropriate name for this algorithm would be
>> > "Distributed
>> > Sort with a Pre-map Phase and a Post-reduce Phase"  Calling it
>> Map-Reduce
>> > and leaving out the word "sort" (the most important part) is a 
>> source of
>> > confusion.
>> >
>> > If you think of it in these terms, I think it's easier to see where
>> > and how
>> > it applies.
>>
>> :) Sure, that's one point of view on this - however, in quite a few
>> applications sort is definitely less important than the ability to 
>> split
>> the processing load in map() and reduce() over many machines. 
>> Sometimes
>> I don't care about the sorting at all (in all cases where
>> IdentityReducer is used).
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>


Mime
View raw message