hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Partitioning reduce output by date
Date Wed, 19 Mar 2008 00:51:11 GMT

I think that a custom partitioner is half of the answer.  The other half is
that the reducer can open and close output files as needed.  With the
partitioner, only one file need be kept open at a time.  It is good practice
to open the files relative to the task directory so that process failure is
handled correctly.

These files are called task side effect files and are documented here:

http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Task+Side-Ef
fect+Files


On 3/18/08 5:17 PM, "Arun C Murthy" <arunc@yahoo-inc.com> wrote:

>> I have not a single part-xxxxx file but, say, 2008-03-01.txt,
>> 2008-03-02.txt, and so on, one file for each distinct date.
>> 
> 
> You want a custom partitioner...
> http://hadoop.apache.org/core/docs/current/
> mapred_tutorial.html#Partitioner


Mime
View raw message