hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chinni, Ravi" <rchi...@syncsort.com>
Subject RE: Why does the MR framework sorts the mapper output?
Date Tue, 27 Jul 2010 14:19:21 GMT
Thanks Alex and Ken.

 

My application does not do aggregation. It mainly does some data
cleansing and transformation. So I don't need a combiner. (Also, I don't
see why a combiner always needs sorted input; it should be optional and
user specified)

 

To take advantage of some optimizations, I need a partitioner - this
means most of my application logic is in the reducers and cannot set the
# of reducers to 0. Of course, I will be happy if there is a way to set
the # of mappers to 0.

 

Ravi

 

 

From: Ken Goodhope [mailto:kengoodhope@gmail.com] 
Sent: Monday, July 26, 2010 8:00 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Why does the MR framework sorts the mapper output?

 

The combiner needs sorted input.

On Mon, Jul 26, 2010 at 1:46 PM, Alex Kozlov <alexvk@cloudera.com>
wrote:

Hi Ravi,

Whether a sort is required is still a point of debate: the primary
reason is to collect the entries with the same key, but one can
implement MapReduce with hash deduping.  The performance
advantages/disadvantages are still a subject of debate.

If you don't need sorting, you can always implement map-side aggregation
though and potentially set the # of reducers to 0.  There is no
potential risk, but if you want to aggregate results across different
mappers you'll get back to the original problem.

Alex K  

 

On Mon, Jul 26, 2010 at 1:32 PM, Chinni, Ravi <rchinni@syncsort.com>
wrote:

I have an MR application that is running fine except for the
performance. Increasing the number of data nodes is not an option to me.

 

Looking at the source code of MR framework, I noticed that the
partitioned output of each mapper is sorted (MapTask.java), and on the
reduce side partitions from various mappers are merged (ReduceTask.java)
before running the reduce step. Functionally, reducers in my application
does not require data to be in sorted order and getting rid of the sort
and merge steps in the framework will help my application. 

 

Does anyone know, why the sort and merge of intermediate data is being
done by the framework? Is there anything - MR functional concepts,
framework design etc. - that will need the sort and merge of
intermediate data? I want to give a shot in getting rid of the sort and
merge steps in the framework and want to know of any potential risks
involved.

 

Any input is appreciated.

 

Thanks,

Ravi

 

 

________________________________________________________________________
_____

 

ATTENTION:

 

The information contained in this message (including any files
transmitted with this message) may contain proprietary, trade secret or
other  confidential and/or legally privileged information. Any pricing
information contained in this message or in any files transmitted with
this message is always confidential and cannot be shared with any third
parties without prior written approval from Syncsort. This message is
intended to be read only by the individual or entity to whom it is
addressed or by their designee. If the reader of this message is not the
intended recipient, you are on notice that any use, disclosure, copying
or distribution of this message, in any form, is strictly prohibited. If
you have received this message in error, please immediately notify the
sender and/or Syncsort and destroy all copies of this message in your
possession, custody or control.

 

 


Mime
View raw message