hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: Combine previous Map Results
Date Mon, 21 Apr 2008 05:37:27 GMT
if one weren't thinking about performance - then the second map-reduce task would have to process
both the data sets (the intermediate data and the new data). For the existing intermediate
data - you want to do an identity map and for the new data - whatever map logic you have.
u can write a mapper that can decide the map logic based on the input file name (look for
the jobconf variable map.input.file in Java - or the environment variable map_input_file in
hadoop streaming).

if one were thinking about performance - then one would argue that re-sorting the existing
intermediate data (as would happen in the simple solution) is pointless (it's already sorted
by the desired key). if this is a concern - the only thing that's available right now (afaik)
is a feature described in hadoop-2085. (you would have to map-reduce the new data set only
and then join the old and new data using map-side joins described in this jira - this would
require a third map-reduce task).

(one could argue that if there was an option to skip map-side sorting on a per-file level
- that would be perfect. one would skip map-side sorts of the old data and only sort the new
data - and the reducer would merge the two).

-----Original Message-----
From: Dina Said [mailto:dinasaid@gmail.com]
Sent: Sat 4/19/2008 1:55 PM
To: core-user@hadoop.apache.org
Subject: Combine previous Map Results
Dear all

Suppose that I have files that have intermediate key values and I want
to combine these intermediate keys values with a new MapReduce task. I
want this MapReduce task to combine during the reduce stage the
intermediate key values it generates with the intermediate key values I
already have.

Any ideas?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message