hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <omal...@apache.org>
Subject Re: almost sorted map output
Date Tue, 01 Jun 2010 03:11:44 GMT
On Fri, May 28, 2010 at 10:58 PM, juber patel <juberpatel@gmail.com> wrote:
> Hello,
>
> Can Hadoop take advantage of the fact that the output of each map task
> is almost sorted?

The map output sort is efficient for mostly sorted output. The
dominant cost is the transfer costs of the shuffle and that won't be
helped by almost sorted output. There is an optimization that we have
discussed for years but not implemented where if you have lop-sided
partitions (such as largely sorted data and a total order
partitioner), you schedule the reduce close to the majority of its
data. In some applications that would be a large win.

> On a related note, Does Hadoop's Quicksort implementation give worst
> case performance on almost sorted data? Should I use Heapsort in its
> place?

The quick sort used on the map side is careful about pivots and the
resulting partitions. I wouldn't worry about replacing it.

Mime
View raw message