hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Re: Partitioning reduce output by date
Date Wed, 19 Mar 2008 00:17:32 GMT

On Mar 18, 2008, at 4:35 PM, Otis Gospodnetic wrote:

> Hi,
>
> What is the best/right way to handle partitioning of the final job  
> output (i.e. output of reduce tasks)?  In my case, I am processing  
> logs whose entries include dates (e.g. "2008-03-01    foo    bar     
> baz").  A single log file may contain a number of different dates,  
> and I'd like to group reduce output by date so that, in the end, I  
> have not a single part-xxxxx file but, say, 2008-03-01.txt,  
> 2008-03-02.txt, and so on, one file for each distinct date.
>

You want a custom partitioner...
http://hadoop.apache.org/core/docs/current/ 
mapred_tutorial.html#Partitioner

Arun

> If it helps, the keys in my job include the dates from the input  
> logs, so I could parse the dates out of the keys in the reduce  
> phase, if that's the thing to do.
>
> I'm looking at OutputFormat and RecordWriter, but I'm not sure if  
> that's the direction I should pursue.
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message