hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Fundatureanu <danimus.g.pr...@gmail.com>
Subject Re: Algorithm used "Shuffle and Sort" step
Date Thu, 29 Apr 2010 18:12:46 GMT
Thank you,
and I think this might hold also the answer for my question:

in "hadoop-default.xml" :

<property>
  <name>map.sort.class</name>
  <value>org.apache.hadoop.util.QuickSort</value>
  <description>The default sort class for sorting keys.</description>
</property>



On Wed, Apr 28, 2010 at 3:27 PM, Patrick Angeles <patrick@cloudera.com>wrote:

> Dan,
>
> Shuffle and Sort is a combination of multiple 'algorithms'.
>
> - Map output goes to a circular, in-memory buffer
> - When this starts filling up, it gets 'spilled' to disk
> - Spilling involves writing each K/V pair to a partition specific file
> (where partition is the algorithm Jim describes below) in order sorted by K
> - You may get multiple files per partition (you get a file per-partition
> every time a spill happens)
> - In which case, the spill files get merge sorted into larger files
> - Reducers pick up the final merged files from multiple mappers
> - Reducers may pick up multiple of these files from several mappers
> - The reducer may perform a final merge before starting the reduce phase
>
>
>
> On Wed, Apr 28, 2010 at 2:47 PM, Jim Twensky <jim.twensky@gmail.com>
> wrote:
>
> > I'm not quite sure what you mean by the Shuffle algorithm but briefly
> > for each (key,value) pair, a hash value is computed and then the
> > record is sent to a node by taking the modulo of that hash. So if you
> > have n reducers, the record goes to reducer # : hash(key) % n.
> >
> > The sorting algorithm is most probably a variation of the external
> > sorting algorithm used for sorting objects that don't fit in the
> > memory. See:
> >
> > http://en.wikipedia.org/wiki/External_sorting
> >
> > for more details.
> >
> > Jim
> >
> > On Wed, Apr 28, 2010 at 1:16 PM, Dan Fundatureanu
> > <danimus.g.prime@gmail.com> wrote:
> > > What is the algorithm used in "Shuffle and Sort" step?
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message