hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: Job performance issue: output.collect()
Date Wed, 01 Sep 2010 16:09:56 GMT

On Sep 1, 2010, at 5:18 AM, Oded Rosen wrote:

> I would like to know what happens in the output.collect line that  
> takes lots
> of time, in order to cut down this job's running time.
> Please keep in mind that I have a combiner, and to my understanding
> different things happen to the map output when a combiner is present.

The best presentation on the map side sort is the one that Chris  
Douglas (who did most of the implementation) did for the Bay Area HUG.

http://developer.yahoo.net/blogs/hadoop/2010/01/hadoop_bay_area_january_2010_u.html

There are both slides and a video of the presentation. I'd run through  
that first.

You most likely are getting more spills than you deserve. The  
variables to look at:

io.sort.mb - should be most of the task's ram budget
io.sort.record.percent - depends on record size
io.sort.factor - typically 25 * (# of disks / node)

-- Owen

Mime
View raw message