hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rasit OZDAS" <rasitoz...@gmail.com>
Subject Re: Merging reducer outputs into a single part-00000 file
Date Wed, 14 Jan 2009 08:46:04 GMT

As far as I know, there is no operation done after Reducer.
At the first look, the situation reminds me of same keys for all the tasks,
This can be the result of one of following cases:
- input format reads same keys for every task.
- mapper collects every incoming key-value pairs under same key.
- reducer makes the same.

But if you  are a little experienced, you already know these.
Ordered list means one final file, or am I missing something?

Hope this helps,

2009/1/11 Jim Twensky <jim.twensky@gmail.com>:
> Hello,
> The original map-reduce paper states: "After successful completion, the
> output of the map-reduce execution is available in the R output files (one
> per reduce task, with file names as specified by the user)." However, when
> using Hadoop's TextOutputFormat, all the reducer outputs are combined in a
> single file called part-00000. I was wondering how and when this merging
> process is done. When the reducer calls output.collect(key,value), is this
> record written to a local temporary output file in the reducer's disk and
> then these local files (a total of R) are later merged into one single file
> with a final thread or is it directly written to the final output file
> (part-00000)? I am asking this because I'd like to get an ordered sample of
> the final output data, ie. one record per every 1000 records or something
> similar and I don't want to run a serial process that iterates on the final
> output file.
> Thanks,
> Jim

M. Raşit ÖZDAŞ
View raw message