Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 90425 invoked from network); 31 Mar 2011 08:39:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 31 Mar 2011 08:39:38 -0000 Received: (qmail 92015 invoked by uid 500); 31 Mar 2011 08:39:35 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 91791 invoked by uid 500); 31 Mar 2011 08:39:30 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 91782 invoked by uid 99); 31 Mar 2011 08:39:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Mar 2011 08:39:28 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [157.193.71.182] (HELO smtp1.UGent.be) (157.193.71.182) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Mar 2011 08:39:22 +0000 Received: from localhost (mcheck3.ugent.be [157.193.71.89]) by smtp1.UGent.be (Postfix) with ESMTP id 135D63F6B36 for ; Thu, 31 Mar 2011 10:39:00 +0200 (CEST) X-Virus-Scanned: by UGent DICT Received: from smtp1.UGent.be ([157.193.71.182]) by localhost (mcheck3.ugent.be [157.193.43.11]) (amavisd-new, port 10024) with ESMTP id Z2LHtRKjcaOK for ; Thu, 31 Mar 2011 10:38:59 +0200 (CEST) Received: from mail2.intec.ugent.be (mail2.intec.ugent.be [157.193.214.245]) by smtp1.UGent.be (Postfix) with ESMTP id D518D3F6AFB for ; Thu, 31 Mar 2011 10:38:58 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mail2.intec.ugent.be (Postfix) with ESMTP id 5E6271F for ; Thu, 31 Mar 2011 10:38:58 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mail2.intec.ugent.be Received: from mail2.intec.ugent.be ([127.0.0.1]) by localhost (mail2.intec.ugent.be [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VrZ0m+YJXG-o for ; Thu, 31 Mar 2011 10:38:58 +0200 (CEST) Received: from localhost.localdomain (dhcp-zdpt-67.intec.ugent.be [157.193.135.67]) by mail2.intec.ugent.be (Postfix) with ESMTP id 4C4D61E for ; Thu, 31 Mar 2011 10:38:58 +0200 (CEST) Date: Thu, 31 Mar 2011 10:35:55 +0200 From: Dieter Plaetinck To: common-user@hadoop.apache.org Subject: Re: # of keys per reducer invocation (streaming api) Message-ID: <20110331103555.1c2a78e3@intec.ugent.be> In-Reply-To: References: <20110329165546.20941610@intec.ugent.be> Organization: Ugent X-Mailer: Claws Mail 3.7.8 (GTK+ 2.22.1; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Miltered: at mcheck3 with ID 4D943DA2.002 by Joe's j-chkmail (http://helpdesk.ugent.be/email/)! X-j-chkmail-Enveloppe: 4D943DA2.002/157.193.214.245/mail2.intec.ugent.be/mail2.intec.ugent.be/ X-j-chkmail-Score: MSGID : 4D943DA2.002 on smtp1.UGent.be : j-chkmail score : . : R=. U=. O=. B=0.000 -> S=0.000 X-j-chkmail-Status: Ham On Tue, 29 Mar 2011 23:17:13 +0530 Harsh J wrote: > Hello, >=20 > On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck > wrote: > > Hi, I'm using the streaming API and I notice my reducer gets - in > > the same invocation - a bunch of different keys, and I wonder why. > > I would expect to get one key per reducer run, as with the "normal" > > hadoop. > > > > Is this to limit the amount of spawned processes, assuming creating > > and destroying processes is usually expensive compared to the > > amount of work they'll need to do (not much, if you have many keys > > with each a handful of values)? > > > > OTOH if you have a high number of values over a small number of > > keys, I would rather stick to one-key-per-reducer-invocation, then > > I don't need to worry about supporting (and allocating memory for) > > multiple input keys. =C2=A0Is there a config setting to enable such > > behavior? > > > > Maybe I'm missing something, but this seems like a big difference in > > comparison to the default way of working, and should maybe be added > > to the FAQ at > > http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+= Asked+Questions > > > > thanks, > > Dieter > > >=20 > I think it would make more sense to think of streaming programs as > complete map/reduce 'tasks', instead of trying to apply the Map/Reduce > functional concept. Both of the programs need to be written from the > reading level onwards, which in map's case each line is record input > and in reduce's case it is one uniquely grouped key and all values > associated to it. One would need to handle the reading-loop > themselves. >=20 > Some non-Java libraries that provide abstractions atop the > streaming/etc. layer allow for more fluent representations of the > map() and reduce() functions, hiding away the other fine details (like > the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce > programs, for example. >=20 > A FAQ entry on this should do good too! You can file a ticket for an > addition of this observation to the streaming docs' FAQ. >=20 > [1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial >=20 Thanks, this makes it a little clearer. I made a ticket @ https://issues.apache.org/jira/browse/MAPREDUCE-2410 Dieter