hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ed Mazur <ma...@cs.umass.edu>
Subject Avoiding value buffering in reduce
Date Sat, 16 Jan 2010 22:15:24 GMT
If you don't make the assumption in your reduce function that you can
fit all values for a key in memory, what's the preferred way of
outputting a collection of values? I've been using ArrayWritable, but
this requires you first build up an array of values in memory. This
worked until I ramped up the size of the input and started getting out
of memory errors.

IdentityReducer would work, but it seems wasteful to output the key
for each value. Right now I'm doing emit(key, "") for the key and
emit("", value) for each value, but this feels like a hack. It also
makes for additional work to serialize back into key/value pairs,
unlike the (memory-consuming) ArrayWritable approach.


View raw message