hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Some Body" <someb...@squareplanet.de>
Subject Re: MultipleOutputs or Partitioner
Date Mon, 10 May 2010 17:45:04 GMT
Thanks a lot.  I was able to use MultipleOutputs to get CLOSER to  what I want. 
i.e. using the changes mentioned below I'm able to generate multiple output 
files like:

	/test/out/2010-04-19_morning.txt-r-00000
	/test/out/2010-04-19_afternoon.txt-r-00000
	/test/out/2010-04-20_morning.txt-r-00000
	/test/out/2010-04-20_afternoon.txt-r-00000

Where each file has <host>_<varname> <value> pairs one per line like:
	host1_cpu	2.0
	host2_cpu	4.0
	host11_mem 	4000.0
	host12_mem 	8000.0

But my real goal is to have a multi-columned output files like:
	host1	2.0	4000.0
	host2	4.0	8000.0

Is this possible via a Recucer or do I need some post-processing?

Relevant Changes:

Added this to my Driver class:
  MultipleOutputs.addNamedOutput(job, "txt", TextOutputFormat.class, Text.class, FloatWritable.class);

Added this to my Reducer class:
	...
	private MultipleOutputs mos;
	 
	public void setup(Context context) { mos = new MultipleOutputs(context); }
	public void cleanup(Context context) throws IOException {
		try {mos.close(); } catch (InterruptedException e) {	e.printStackTrace(); }
	}
      public static String getSubKey(String text, Integer start) {
                String[] parts = text.split("_");
                return parts[start] + "_" + parts[start+1];
      }
	public void reduce(Text key, Iterable<FloatWritable> values, Context context) {
	....
		mos.write("txt", getSubKey(key,0), 
			new FloatWritable(average.floatValue()), getSubKey(key,2)+".txt");
	}


Alan

----- original message --------

Subject: Re: MultipleOutputs or Partitioner
Sent: Mon, 10 May 2010
From: Sonal Goyal

Hi Alan,

You can use MultipleOutputFormat. You can override the generateFileName...methods to get the
functionality you want. 

A partitioner controls how data moves from the mapper to the reducer, so if you take that
approach, you will have to specify the number of reducers as the number of files you want,
which is not the best option if some days have more data than the others. You also dont have
control over the file name. See Tom White's Hadoop The Definitive Guide for an excellent example
and usage.
  
Thanks and Regards,
Sonal
www.meghsoft.com


On Mon, May 10, 2010 at 5:38 PM, Some Body <somebody@squareplanet.de> wrote:
Hi,

I'm trying to understand how to generate multiple outputs in my reducer (using 0.20.2+228).
Do I need MultipleOutput or should I partition my output in the mapper?

My reducer currently gets key/val input pairs like this which all end up in my part_r_0000
file.

   hostA_VarX_2010-05-01_morning    <FLOATVAL>
   hostA_VarY_2010-05-01_morning    <FLOATVAL>
   hostA_VarX_2010-05-01_afternoon    <FLOATVAL>
   hostA_VarY_2010-05-01_afternoon    <FLOATVAL>
   .....
   hostB_VarX_2010-05-01_morning    <FLOATVAL>
   hostB_VarY_2010-05-01_morning    <FLOATVAL>
   hostB_VarX_2010-05-01_afternoon    <FLOATVAL>
   hostB_VarY_2010-05-01_afternoon    <FLOATVAL>
   .....
   hostA_VarX_2010-05-02_morning    <FLOATVAL>
   hostA_VarY_2010-05-02_morning    <FLOATVAL>
   hostA_VarX_2010-05-02_afternoon    <FLOATVAL>
   hostA_VarY_2010-05-02_afternoon    <FLOATVAL>
   .....
   hostB_VarX_2010-05-02_morning    <FLOATVAL>
   hostB_VarY_2010-05-02_morning    <FLOATVAL>
   hostB_VarX_2010-05-02_afternoon    <FLOATVAL>
   hostB_VarY_2010-05-02_afternoon    <FLOATVAL>
   .....

But instead of 1 output file I want one output file per day/group. e.g.
   2010-05-01_morning.txt
   2010-05-01_afternoon.txt

Each <date>_<time>.txt file would contain all keys/vals for all hosts & VarNames

Thanks,
Alan


--- original message end ----


Mime
View raw message