hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bai Shen <baishen.li...@gmail.com>
Subject Re: Hadoop map reduce merge algorithm
Date Thu, 12 Jan 2012 21:42:50 GMT
That's my understanding as well.  I can't seem to find any settings that
govern the step where the output is merged into a single file.
io.sort.factor modifies the number of passes that is done, but it
eventually ends up doing the same thing no matter how many spill files
there are.  They're simply combined incrementally instead of all at once.

Is anybody more familiar with this step of the process?


On Thu, Jan 12, 2012 at 2:27 PM, Robert Evans <evans@yahoo-inc.com> wrote:

>  My understanding is that the mapper will cache the output in memory
> until its memory buffer fills up, at which point it will sort the data and
> spill it to disk.  Once a given number of spill files are created they will
> be merged together into a larger spill file.  Once the mapper finishes then
> the output is totally merged into a single file that can be served to the
> Reducer through the TaskTracker, or NodeManger under YARN.  The reducer
> does a similar thing as it merges the output form all of the mappers.  I
> don’t understand all of the reasons behind this, but I think much of it is
> to optimize the time it takes to sort the data.  If you try to merge too
> many files then you waste a lot of time doing seeks and less time reading
> data.  But I was not involved with developing it so I don’t know for sure.
> --Bobby Evans
> On 1/12/12 10:27 AM, "Bai Shen" <baishen.lists@gmail.com> wrote:
> Can someone explain how the map reduce merge is done?  As far as I can
> tell, it appears to pull all of the spill files into one giant file to send
> to the reducer.  Is this correct?  Even if you set smaller spill files and
> a lower sort factor, the eventual merge is still the same.  It just takes
> more passes to get there.
> Thanks.

View raw message