hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ed Mazur <ma...@cs.umass.edu>
Subject Re: sort done parallel or after copy ?
Date Fri, 05 Mar 2010 15:29:48 GMT
Hi Prasen,

The data that reduce tasks receive during shuffle (copy) has already
been sorted by map tasks, so they just have to be merged.

This merge happens in parallel with the shuffle. When a reduce task's
in-memory buffer of sorted map output files reaches a certain
threshold, they are merged and written to disk. If the number of
on-disk files created through this process exceeds 2n-1 where
n=io.sort.factor (10 by default), n of these files get merged so that
there are n remaining. When the shuffle ends, there can be anywhere
between 0 to 2n-1 files on disk to be merged still. These get merged
down to (at most) n files and a final merge goes directly into the
user reduce function.


On Fri, Mar 5, 2010 at 12:36 AM, prasenjit mukherjee
<prasen.bea@gmail.com> wrote:
> if I understand correctly reduce has 3 stages : copy,sort,reduce. Copy
> happens  parallely with mappers  still running. Reduce has to wait
> till all the mappers are done.
> For sorting we could have 2 options :
> 1) Entire sorting happens after copy ( in a single shot ) OR
> 2) It could happen along with copy where each block is sorted and
> later merged ( via merge-sort  )
> How is it being currently done in hadoop's latest version ?
> -Thanks,
> Prasen

View raw message