hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: Job performance issue: output.collect()
Date Wed, 01 Sep 2010 16:09:56 GMT

On Sep 1, 2010, at 5:18 AM, Oded Rosen wrote:

> I would like to know what happens in the output.collect line that  
> takes lots
> of time, in order to cut down this job's running time.
> Please keep in mind that I have a combiner, and to my understanding
> different things happen to the map output when a combiner is present.

The best presentation on the map side sort is the one that Chris  
Douglas (who did most of the implementation) did for the Bay Area HUG.


There are both slides and a video of the presentation. I'd run through  
that first.

You most likely are getting more spills than you deserve. The  
variables to look at:

io.sort.mb - should be most of the task's ram budget
io.sort.record.percent - depends on record size
io.sort.factor - typically 25 * (# of disks / node)

-- Owen

View raw message