hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ajit Ratnaparkhi <ajit.ratnapar...@gmail.com>
Subject Re: Optimizing jobs where map phase generates more output than its input.
Date Tue, 21 Feb 2012 13:44:43 GMT
Thanks Anand.

Mine combiner is same as reducer and it reduces data a lot (result data
size would be less than 0.1 % of input data size). I tried setting these
properties(io.sort.mb to 500mb from 100mb, java heap size is 1GB), it
improved performance but not much.

On Tue, Feb 21, 2012 at 3:26 PM, Anand Srivastava <
Anand.Srivastava@guavus.com> wrote:

> Hi Ajit,
>        You could experiment with a higher value of "io.sort.mb" so that
> the combiner is more effective. However if you combiner is such that it
> does not really 'reduce' the number of records, it would not help. You will
> have to increase the java heap size as well (mapred.child.java.opts) so
> that your tasks don't go out of memory.
> Regards,
> Anand
> On 21-Feb-2012, at 3:09 PM, Ajit Ratnaparkhi wrote:
> > Hi,
> >
> > This about a typical pattern of map-reduce jobs,
> >
> > There are some map-reduce jobs in which map phase generates records
> which are more in number than its input, at reduce phase this data reduces
> a lot and final output of reduce is very small.
> > Eg. Each map function call ie. for each input record map generates
> approx 100 output records(one output record is approx of same size as one
> input record). Combiner is applied, output of map is shuffled and it
> reaches reducer, where it is reduced to very small size output data (say
> less than 0.1% of input data size to map).
> >
> > Time taken for execution of such kind of job(where output of map is more
> than its input) is considerably high if you compare those with jobs with
> same/less output map records for same input data.
> >
> > Has anybody worked on optimizing such jobs? any configuration tuning
> which might help here?
> >
> > -Ajit
> >
> >

View raw message