Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Thu, 31 Mar 2011 10:35:55 +0200
From: Dieter Plaetinck <dieter.plaetinck@intec.ugent.be>
To: common-user@hadoop.apache.org
Subject: Re: # of keys per reducer invocation (streaming api)
Message-ID: <20110331103555.1c2a78e3@intec.ugent.be>
In-Reply-To: <AANLkTikLq+8mTNpM4=aKQ3GkQFHwZzqQi66kBR22nDi-@mail.gmail.com>
References: <20110329165546.20941610@intec.ugent.be>
	<AANLkTikLq+8mTNpM4=aKQ3GkQFHwZzqQi66kBR22nDi-@mail.gmail.com>
Organization: Ugent
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Tue, 29 Mar 2011 23:17:13 +0530
Harsh J <qwertymaniac@gmail.com> wrote:

> Hello,
>=20
> On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck
> <dieter.plaetinck@intec.ugent.be> wrote:
> > Hi, I'm using the streaming API and I notice my reducer gets - in
> > the same invocation - a bunch of different keys, and I wonder why.
> > I would expect to get one key per reducer run, as with the "normal"
> > hadoop.
> >
> > Is this to limit the amount of spawned processes, assuming creating
> > and destroying processes is usually expensive compared to the
> > amount of work they'll need to do (not much, if you have many keys
> > with each a handful of values)?
> >
> > OTOH if you have a high number of values over a small number of
> > keys, I would rather stick to one-key-per-reducer-invocation, then
> > I don't need to worry about supporting (and allocating memory for)
> > multiple input keys. =C2=A0Is there a config setting to enable such
> > behavior?
> >
> > Maybe I'm missing something, but this seems like a big difference in
> > comparison to the default way of working, and should maybe be added
> > to the FAQ at
> > http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+=
Asked+Questions
> >
> > thanks,
> > Dieter
> >
>=20
> I think it would make more sense to think of streaming programs as
> complete map/reduce 'tasks', instead of trying to apply the Map/Reduce
> functional concept. Both of the programs need to be written from the
> reading level onwards, which in map's case each line is record input
> and in reduce's case it is one uniquely grouped key and all values
> associated to it. One would need to handle the reading-loop
> themselves.
>=20
> Some non-Java libraries that provide abstractions atop the
> streaming/etc. layer allow for more fluent representations of the
> map() and reduce() functions, hiding away the other fine details (like
> the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce
> programs, for example.
>=20
> A FAQ entry on this should do good too! You can file a ticket for an
> addition of this observation to the streaming docs' FAQ.
>=20
> [1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial
>=20

Thanks,
this makes it a little clearer.
I made a ticket @ https://issues.apache.org/jira/browse/MAPREDUCE-2410

Dieter