hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Twensky" <jim.twen...@gmail.com>
Subject Merging reducer outputs into a single part-00000 file
Date Sun, 11 Jan 2009 07:55:35 GMT

The original map-reduce paper states: "After successful completion, the
output of the map-reduce execution is available in the R output files (one
per reduce task, with file names as specified by the user)." However, when
using Hadoop's TextOutputFormat, all the reducer outputs are combined in a
single file called part-00000. I was wondering how and when this merging
process is done. When the reducer calls output.collect(key,value), is this
record written to a local temporary output file in the reducer's disk and
then these local files (a total of R) are later merged into one single file
with a final thread or is it directly written to the final output file
(part-00000)? I am asking this because I'd like to get an ordered sample of
the final output data, ie. one record per every 1000 records or something
similar and I don't want to run a serial process that iterates on the final
output file.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message