hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Miller <someb...@squareplanet.de>
Subject Re: tab-delimited output
Date Thu, 13 May 2010 21:17:12 GMT
Thanks Alex,

For question 2, I was able to implement a Custom OutputFormat that
allows me to write some header lines to a file then write multiple
tab-delimited values per line like I wanted.

I had to "extend FileOutputFormat" and implement my own
write(),close() and getRecordWriter().

The 1st question is still open for me though. How to separate reducer
outputs based on a substring of the reducer's key.
In my Driver class I now use
so I can't use MultipleOutput.class to disect the outputs.

Is there a way to make my MyOutputFormat.class work like MultipleOutput?
The getRecordWriter calls job.getConfiguration() so could I do something 
   set a new filename in my reduce() via conf.set("fileprefix", 
   read the new filename in getRecordWriter() via conf.get("fileprefix");


On 05/13/2010 12:29 AM, Alex Kozlov wrote:
> Hi Alan,
> Unless you run your job with a single reducer you will not be able to 
> do this.  Think scalable: you should always add '-r-NNNNN' to the end 
> to allow for multiple reducers and you can use custom partitioner to 
> make sure each host goes to a single reducer.  MultipleOutputs can do 
> the rest, meaning the 'YYYY-MM-DD' prefix.  2 looks like a simple 
> aggregation job: the key should be the host name, and you need just to 
> aggregate the values for each host x YYYY-MM-DD pair and write them 
> into separate 'YYYY-MM-DD-r-NNNNN' files.  You can also do secondary 
> sort to make sure the YYYY-MM-DD values come in order: this way you do 
> not need to aggregate them in memory.  See Reducer.java 
> <http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html>

> for details.
> Alex K
> On Wed, May 12, 2010 at 3:04 PM, Alan Miller <somebody@squareplanet.de 
> <mailto:somebody@squareplanet.de>> wrote:
>     Hi Alex,
>     The tab isn't the issue (yet). I guess it's really 2 questions I have.
>     Using the reducer inputs already mentioned.
>     1. How do I generate multiple output files named YYYY-MM-DD.txt
>     2. Each file should contain
>          a. one line per host
>          b. each line with host avg1 avg2 avg3 ....
>     Alan
>     On 05/12/2010 11:50 PM, Alex Kozlov wrote:
>>     Hi Alan,
>>     Is the problem that you want your 'value' vals to be tab
>>     separated?   This is entirely under control of your reducer.
>>     Alex K
>>     On Wed, May 12, 2010 at 2:07 PM, Alan Miller
>>     <somebody@squareplanet.de <mailto:somebody@squareplanet.de>> wrote:
>>         Hi all,
>>         How can I write tab-delimited output files from my reducer?
>>         My reducer gets Text/Text key/vals like:
>>         hostX_2010-05-01 varA=valA1,varB=valB1,varC=valC1
>>         hostX_2010-05-01 varA=valA2,varB=valB2,varC=valC2
>>         hostX_2010-05-01 varA=valA3,varB=valB3,varC=valC3
>>         ...
>>         hostY_2010-05-01 varA=valA1,varB=valB1,varC=valC1
>>         hostY_2010-05-01 varA=valA2,varB=valB2,varC=valC2
>>         hostY_2010-05-01 varA=valA3,varB=valB3,varC=valC3
>>         ...
>>         After my reducer calcs the daily averages of varA,B,C
>>         I  want to write a tab-delimited file with lines like:
>>         hostX    varA-Avg    varB-Avg    varC-Avg    ....
>>         hostY    varA-Avg    varB-Avg    varC-Avg    ....
>>         Thanks,
>>         Alan

View raw message