hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anand Srivastava <Anand.Srivast...@guavus.com>
Subject Re: Optimizing jobs where map phase generates more output than its input.
Date Tue, 21 Feb 2012 09:56:30 GMT
Hi Ajit,
	You could experiment with a higher value of "io.sort.mb" so that the combiner is more effective.
However if you combiner is such that it does not really 'reduce' the number of records, it
would not help. You will have to increase the java heap size as well (mapred.child.java.opts)
so that your tasks don't go out of memory.


On 21-Feb-2012, at 3:09 PM, Ajit Ratnaparkhi wrote:

> Hi,
> This about a typical pattern of map-reduce jobs,
> There are some map-reduce jobs in which map phase generates records which are more in
number than its input, at reduce phase this data reduces a lot and final output of reduce
is very small.
> Eg. Each map function call ie. for each input record map generates approx 100 output
records(one output record is approx of same size as one input record). Combiner is applied,
output of map is shuffled and it reaches reducer, where it is reduced to very small size output
data (say less than 0.1% of input data size to map).
> Time taken for execution of such kind of job(where output of map is more than its input)
is considerably high if you compare those with jobs with same/less output map records for
same input data.
> Has anybody worked on optimizing such jobs? any configuration tuning which might help
> -Ajit

View raw message