hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: Merging reducer outputs into a single part-00000 file
Date Wed, 14 Jan 2009 17:23:08 GMT
On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wrote:

> Jim,
> As far as I know, there is no operation done after Reducer.

Correct, other than output promotion, which moves the output file to  
the final filename.

> But if you  are a little experienced, you already know these.
> Ordered list means one final file, or am I missing something?

There is no value and a lot of cost associated with creating a single  
file for the output. The question is how you want the keys divided  
between the reduces (and therefore output files). The default  
partitioner hashes the key and mods by the number of reduces, which  
"stripes" the keys across the output files. You can use the  
mapred.lib.InputSampler to generate good partition keys and  
mapred.lib.TotalOrderPartitioner to get completely sorted output based  
on the partition keys. With the total order partitioner, each reduce  
gets an increasing range of keys and thus has all of the nice  
properties of a single file without the costs.

-- Owen

View raw message