hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Will <stefan.w...@gmx.net>
Subject Re: Merging reducer outputs into a single part-00000 file
Date Sun, 11 Jan 2009 20:58:20 GMT

As far as I know, there is no difference in terms of the number of output
partitions relative to the OutputFormat used.

If you want to sample your output file, I'd suggest you write a new MR job
that uses a random number generator to sample your output files, and outputs
text key/value pairs in the mapper, and uses exactly one reducer with the
TextOutputFormat. You don't even need to supply a reducer class if your
mapper outputs Text/Text key/value pairs.

-- Stefan

> From: Jim Twensky <jim.twensky@gmail.com>
> Reply-To: <core-user@hadoop.apache.org>
> Date: Sun, 11 Jan 2009 01:55:35 -0600
> To: <core-user@hadoop.apache.org>
> Subject: Merging reducer outputs into a single part-00000 file
> Hello,

The original map-reduce paper states: "After successful completion,
> the
output of the map-reduce execution is available in the R output files
> (one
per reduce task, with file names as specified by the user)." However,
> when
using Hadoop's TextOutputFormat, all the reducer outputs are combined in
> a
single file called part-00000. I was wondering how and when this
> merging
process is done. When the reducer calls output.collect(key,value), is
> this
record written to a local temporary output file in the reducer's disk
> and
then these local files (a total of R) are later merged into one single
> file
with a final thread or is it directly written to the final output
> file
(part-00000)? I am asking this because I'd like to get an ordered sample
> of
the final output data, ie. one record per every 1000 records or
> something
similar and I don't want to run a serial process that iterates on
> the final
output file.


View raw message