Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Received-SPF: neutral (herse.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=mime-version:in-reply-to:references:content-type:message-id:
	content-transfer-encoding:from:subject:date:to:x-mailer;
	b=hKBwltE71D1Q82tKIelfBn9PG+bHVmQy8HgLjPuG86XnbrmF1rsYvdZ2k3Q4VmKJ
Mime-Version: 1.0 (Apple Message framework v624)
In-Reply-To: <e1fe19820701241732q79b51184gca3651fd25c65828@mail.gmail.com>
References: <45B4A63F.2060006@getopt.org>
 <e1fe19820701241048i3b703c78r77dfadf3eea14532@mail.gmail.com>
 <45B7AD93.2050905@getopt.org>
 <e1fe19820701241732q79b51184gca3651fd25c65828@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed
Message-Id: <fe756ac54f674a549c16bc46ccb721b6@yahoo-inc.com>
Content-Transfer-Encoding: 7bit
From: Arkady Borkovsky <arkady@yahoo-inc.com>
Subject: Re: What do people use Hadoop for?
Date: Thu, 25 Jan 2007 10:08:41 -0800
To: hadoop-dev@lucene.apache.org

"Disabling the sort"  == "map without reduce" == "map writes the output 
into DFS"
is indeed a very useful and desirable feature.
File a JIRA issue.


On Jan 24, 2007, at 5:32 PM, Doug Judd wrote:

> After digging into this a bit, it looks like the use of 
> IdentityReducer does
> not disable the sort.  I wrote a simple Map/Reduce program that uses
> /usr/share/dict/words as input and generates keys that are a Text
> representation of the CRC of the word modulo 65536 and values that are 
> the
> word itself.  I set the reducer to be the IdentityReducer and the 
> output
> came out sorted:
>
> 0       apperceptively
> 0       Connarus
> 1       overfold
> 1       derationalization
> 1       gymnasium
> 10      respecting
> 10      supperwards
> 100     cellulofibrous
> 100     drogherman
> 100     heteroptics
> 1000    bacao
> 1000    Cumaean
> 1000    didymate
> 1000    disbelieving
> 1001    polymer
> 1001    salveline
> 1001    workwomanly
> 1002    sporty
> 1002    bakal
> 1003    preferentialist
>
> Also, after reviewing the Google paper, they make no mention of the 
> sort
> being disabled by the Identity reducer.  In fact, they describe their 
> Sort
> implementation as using the identity reducer.
>
> Unless I'm missing something, I retract my previous statement.  
> Map-Reduce
> is really just distributed sort.  I do think that being able to 
> disable the
> sort is a much needed enhancement, especially since quite a few 
> applications
> don't need it.
>
> - Doug
>
> On 1/24/07, Andrzej Bialecki <ab@getopt.org> wrote:
>>
>> Doug Judd wrote:
>> > Part of the problem is that calling the paradigm "Map-Reduce" is
>> somewhat
>> > misleading.  It is really just a distributed sort.  The sort is 
>> where
>> > all of
>> > the complexity comes from.  Invoking map() over the input is O(n),
>> > invoking
>> > reduce() over the intermediate results is O(n) as well.  The sort is
>> > O(nlogn).  A more appropriate name for this algorithm would be
>> > "Distributed
>> > Sort with a Pre-map Phase and a Post-reduce Phase"  Calling it
>> Map-Reduce
>> > and leaving out the word "sort" (the most important part) is a 
>> source of
>> > confusion.
>> >
>> > If you think of it in these terms, I think it's easier to see where
>> > and how
>> > it applies.
>>
>> :) Sure, that's one point of view on this - however, in quite a few
>> applications sort is definitely less important than the ability to 
>> split
>> the processing load in map() and reduce() over many machines. 
>> Sometimes
>> I don't care about the sorting at all (in all cases where
>> IdentityReducer is used).
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>