hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix.徐 <ygnhz...@gmail.com>
Subject Collect, Spill and Merge phases insight
Date Tue, 16 Jul 2013 07:52:03 GMT
Hi all,

I am trying to understand the process of Collect, Spill and Merge in Map,
I've referred to a few documentations but still have a few questions.

Here is my understanding about the spill phase in map:

1.Collect function add a record into the buffer.
2.If the buffer exceeds a threshold (determined by parameters like
io.sort.mb), spill phase begins.
3.Spill phase includes 3 actions : sort , combine and compression.
4.Spill may be performed multiple times thus a few spilled files will be
generated.
5.If there are more than 1 spilled files, Merge phase begins and merge
these files into a big one.

If there is any miss understanding about these phases, please correct me
,thanks!
And my questions are:

1.Where is the partition being calculated (in Collect or Spill) ?  Does
Collect simply append a record into the buffer and check whether we should
spill the buffer?

2.At Merge phase, since the spilled files are compressed, does it need to
uncompressed these files and compress them again? Since Merge may be
performed more than 1 round, does it compress intermediate files?

3.Does the Merge phase at Map and Reduce side almost the same (External
merge-sort combined with Min-Heap) ?

Mime
View raw message