Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 67121 invoked from network); 25 Jan 2007 18:09:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Jan 2007 18:09:22 -0000 Received: (qmail 93703 invoked by uid 500); 25 Jan 2007 18:09:27 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 93639 invoked by uid 500); 25 Jan 2007 18:09:27 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 93576 invoked by uid 99); 25 Jan 2007 18:09:27 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jan 2007 10:09:27 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [207.126.228.150] (HELO rsmtp2.corp.yahoo.com) (207.126.228.150) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jan 2007 10:09:17 -0800 Received: from [10.72.108.58] (arkady-mac.corp.yahoo.com [10.72.108.58]) by rsmtp2.corp.yahoo.com (8.13.8/8.13.6/y.rout) with ESMTP id l0PI8huh078058 for ; Thu, 25 Jan 2007 10:08:43 -0800 (PST) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=mime-version:in-reply-to:references:content-type:message-id: content-transfer-encoding:from:subject:date:to:x-mailer; b=hKBwltE71D1Q82tKIelfBn9PG+bHVmQy8HgLjPuG86XnbrmF1rsYvdZ2k3Q4VmKJ Mime-Version: 1.0 (Apple Message framework v624) In-Reply-To: References: <45B4A63F.2060006@getopt.org> <45B7AD93.2050905@getopt.org> Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Arkady Borkovsky Subject: Re: What do people use Hadoop for? Date: Thu, 25 Jan 2007 10:08:41 -0800 To: hadoop-dev@lucene.apache.org X-Mailer: Apple Mail (2.624) X-Virus-Checked: Checked by ClamAV on apache.org "Disabling the sort" == "map without reduce" == "map writes the output into DFS" is indeed a very useful and desirable feature. File a JIRA issue. On Jan 24, 2007, at 5:32 PM, Doug Judd wrote: > After digging into this a bit, it looks like the use of > IdentityReducer does > not disable the sort. I wrote a simple Map/Reduce program that uses > /usr/share/dict/words as input and generates keys that are a Text > representation of the CRC of the word modulo 65536 and values that are > the > word itself. I set the reducer to be the IdentityReducer and the > output > came out sorted: > > 0 apperceptively > 0 Connarus > 1 overfold > 1 derationalization > 1 gymnasium > 10 respecting > 10 supperwards > 100 cellulofibrous > 100 drogherman > 100 heteroptics > 1000 bacao > 1000 Cumaean > 1000 didymate > 1000 disbelieving > 1001 polymer > 1001 salveline > 1001 workwomanly > 1002 sporty > 1002 bakal > 1003 preferentialist > > Also, after reviewing the Google paper, they make no mention of the > sort > being disabled by the Identity reducer. In fact, they describe their > Sort > implementation as using the identity reducer. > > Unless I'm missing something, I retract my previous statement. > Map-Reduce > is really just distributed sort. I do think that being able to > disable the > sort is a much needed enhancement, especially since quite a few > applications > don't need it. > > - Doug > > On 1/24/07, Andrzej Bialecki wrote: >> >> Doug Judd wrote: >> > Part of the problem is that calling the paradigm "Map-Reduce" is >> somewhat >> > misleading. It is really just a distributed sort. The sort is >> where >> > all of >> > the complexity comes from. Invoking map() over the input is O(n), >> > invoking >> > reduce() over the intermediate results is O(n) as well. The sort is >> > O(nlogn). A more appropriate name for this algorithm would be >> > "Distributed >> > Sort with a Pre-map Phase and a Post-reduce Phase" Calling it >> Map-Reduce >> > and leaving out the word "sort" (the most important part) is a >> source of >> > confusion. >> > >> > If you think of it in these terms, I think it's easier to see where >> > and how >> > it applies. >> >> :) Sure, that's one point of view on this - however, in quite a few >> applications sort is definitely less important than the ability to >> split >> the processing load in map() and reduce() over many machines. >> Sometimes >> I don't care about the sorting at all (in all cases where >> IdentityReducer is used). >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> >>