hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: Poly-reduce?
Date Thu, 23 Aug 2007 21:07:10 GMT
Completely agree. We are seeing the same pattern - need a series of
map-reduce jobs for most stuff. There are a few different alternatives
that may help:

1. The output of the intermediate reduce phases can be written to files
that are not replicated. Not sure whether we can do this through
map-reduce - but hdfs seems to be able to set replication level per

2. Map tasks of the next step are streamed data directly from preceding
reduce tasks. This is more along the lines Ted is suggesting - make
iterative map-reduce a primitive natively supported in Hadoop. This is
probably a better solution - but more work? 

I am sure this has been encountered in other scenarios (heck - I am just
a month into using hadoop) - so would be interested to know what other
people are thinking and whether there are any upcoming features to
support this programming paradigm ..


-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, August 22, 2007 10:56 AM
To: hadoop-user@lucene.apache.org
Subject: Poly-reduce?

I am finding that it is a common pattern that multi-phase map-reduce
programs I need to write very often have nearly degenerate map functions
second and later map-reduce phases.  The only need for these function is
select the next reduce key and very often, a local combiner can be used
greatly decrease the number of records passed to the second reduce.

It isn't hard to implement these programs as multiple fully fledged
map-reduces, but it appears to me that many of them would be better
expressed as something more like a map-reduce-reduce program.

For example, take the problem of coocurrence counting in log records.
first map would extract a user id and an object id and group on user id.
The second reduce would take entire sessions for a single user and
co-occurrence pairs as keys for the second reduce, each with a count
determined by the frequency of the objects in the user history.  The
reduce (and local combiner) would aggregate these counts and discard
with small counts. 

Expressed conventionally, this would have write all of the user sessions
HDFS and a second map phase would generate the pairs for counting.  The
opportunity for efficiency would come from the ability to avoid writing
intermediate results to the distributed data store.
Has anybody looked at whether this would help and whether it would be
to do?

View raw message