hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dieter Plaetinck <dieter.plaeti...@intec.ugent.be>
Subject Re: # of keys per reducer invocation (streaming api)
Date Thu, 31 Mar 2011 08:35:55 GMT
On Tue, 29 Mar 2011 23:17:13 +0530
Harsh J <qwertymaniac@gmail.com> wrote:

> Hello,
> 
> On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck
> <dieter.plaetinck@intec.ugent.be> wrote:
> > Hi, I'm using the streaming API and I notice my reducer gets - in
> > the same invocation - a bunch of different keys, and I wonder why.
> > I would expect to get one key per reducer run, as with the "normal"
> > hadoop.
> >
> > Is this to limit the amount of spawned processes, assuming creating
> > and destroying processes is usually expensive compared to the
> > amount of work they'll need to do (not much, if you have many keys
> > with each a handful of values)?
> >
> > OTOH if you have a high number of values over a small number of
> > keys, I would rather stick to one-key-per-reducer-invocation, then
> > I don't need to worry about supporting (and allocating memory for)
> > multiple input keys.  Is there a config setting to enable such
> > behavior?
> >
> > Maybe I'm missing something, but this seems like a big difference in
> > comparison to the default way of working, and should maybe be added
> > to the FAQ at
> > http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions
> >
> > thanks,
> > Dieter
> >
> 
> I think it would make more sense to think of streaming programs as
> complete map/reduce 'tasks', instead of trying to apply the Map/Reduce
> functional concept. Both of the programs need to be written from the
> reading level onwards, which in map's case each line is record input
> and in reduce's case it is one uniquely grouped key and all values
> associated to it. One would need to handle the reading-loop
> themselves.
> 
> Some non-Java libraries that provide abstractions atop the
> streaming/etc. layer allow for more fluent representations of the
> map() and reduce() functions, hiding away the other fine details (like
> the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce
> programs, for example.
> 
> A FAQ entry on this should do good too! You can file a ticket for an
> addition of this observation to the streaming docs' FAQ.
> 
> [1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial
> 

Thanks,
this makes it a little clearer.
I made a ticket @ https://issues.apache.org/jira/browse/MAPREDUCE-2410

Dieter

Mime
View raw message