hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asif Jan <Asif....@unige.ch>
Subject Re: how to do a reduce-only job
Date Fri, 16 Jul 2010 07:41:11 GMT
you need to join these files into 1; you could ether do a map-side  
join, or reduce-side join

for map-side join (slightly more involved)  look at example:

org.apache.hadoop.examples.Join

for reduce side join simply create 2 mappers (one for each file) and  
one reduce (as long as you keep key-value for both same)
You will have to use mutliple input format for doing so.

e.g.
MultipleInputs.addInputPath(conf, path1, input_format1, mapper_class1)
MultipleInputs.addInputPath(conf, path2, input_format2, mapper_class2)

The javadoc of the class explains it further.

cheers





On Jul 15, 2010, at 10:26 PM, David Hawthorne wrote:

> I have two previously created output files of format:
>
> key[tab]value
>
> where key is text, value is an integer sum of how many times the key  
> appeared.
>
> I would like to reduce these output files together into one new  
> output file.  I'm having problems finding out how to do this.
>
> I've found ways to specify a job with no reducers, but it doesn't  
> look like there's a way to specify a reduce-only job, aside from  
> using the streaming interface with 'cat' as the mapper.  I'm not  
> opposed to this, but I also couldn't find a way to specify 'cat' as  
> a mapper and the reducer in my java class as the reducer.  I'm also  
> not sure this would work, as the reducer might simply see the entire  
> line emitted by cat as the key.  I could use awk as the reducer, but  
> I've heard that streaming is less performant than java, and I've  
> already got the java class written. I could write another java class  
> with a mapper that splits in the value on tab and emits the two  
> fields as <key, value>, but that seems like it would be extra work  
> and less optimal than being able to run a reduce-only job.
>
> So... what are the options?  Is there a way to specify a reduce-only  
> job?











Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message