hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joydeep Sen Sarma <jssa...@facebook.com>
Subject RE: how use only a reducer without a mapper
Date Wed, 27 Aug 2008 21:30:30 GMT
It would be useful to have no-sort option in the map stage (ideally on a
per file basis - perhaps using a regex).

With sorted data sets - the re-sorting is often unnecessary. As well -
one can have operations that deal with a mix of sorted and unsorted data
(a merge of a sorted table with new unsorted entries would be a good
example - where part of the data set needs to be sorted and then merged
with previously sorted data).

Although - I am not sure of the typical cost of the map-side sort
relative to the overall job.

-----Original Message-----
From: Jason Venner [mailto:jason@attributor.com] 
Sent: Wednesday, August 27, 2008 9:28 AM
To: core-user@hadoop.apache.org
Subject: Re: how use only a reducer without a mapper

The down side of this (which appears to be the only way) is that your 
entire input data set has to pass through the identity mapper and then 
go through shuffle and sort before it gets to the reducer.
If you have a large input data set, this takes real resources - cpu, 
disk, network and wall clock time.

What we have been doing is making map files of our data sets, and 
running the Join code on them, then we have reduce equivalent capability

in the mapper.

Richard Tomsett wrote:
> Leandro Alvim wrote:
>> How can i use only a reduce without map?
> I don't know if there's a way to run just a reduce task without a map 
> stage, but you could do it by having a map stage just using the 
> IdentityMapper class (which passes the data through to the reducers 
> unchanged), so effectively just doing a reduce.
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 

View raw message