hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Gummadi" <gr...@yahoo-inc.com>
Subject Re: Hadoop map reduce merge algorithm
Date Fri, 13 Jan 2012 04:14:53 GMT
Yes. Spills of map output get merged to single file. The spills are triggered by the buffer
size set using the configuration property io.sort.mb. Obviously bigger value for io.sort.mb
is preferred for better performance --- but the limit is to be set based on the amount of
RAM available.
Also, the bigger the value for the configuration property io.sort.factor the better in terms
of performance. Even in this case, smaller value may have to be set for this config property
based on the size of RAM available.

-Ravi

On 1/13/12   Friday 3:12 AM, "Bai Shen" <baishen.lists@gmail.com> wrote:

That's my understanding as well.  I can't seem to find any settings that govern the step where
the output is merged into a single file.  io.sort.factor modifies the number of passes that
is done, but it eventually ends up doing the same thing no matter how many spill files there
are.  They're simply combined incrementally instead of all at once.

Is anybody more familiar with this step of the process?

Thanks.

On Thu, Jan 12, 2012 at 2:27 PM, Robert Evans <evans@yahoo-inc.com> wrote:
My understanding is that the mapper will cache the output in memory until its memory buffer
fills up, at which point it will sort the data and spill it to disk.  Once a given number
of spill files are created they will be merged together into a larger spill file.  Once the
mapper finishes then the output is totally merged into a single file that can be served to
the Reducer through the TaskTracker, or NodeManger under YARN.  The reducer does a similar
thing as it merges the output form all of the mappers.  I don't understand all of the reasons
behind this, but I think much of it is to optimize the time it takes to sort the data.  If
you try to merge too many files then you waste a lot of time doing seeks and less time reading
data.  But I was not involved with developing it so I don't know for sure.

--Bobby Evans


On 1/12/12 10:27 AM, "Bai Shen" <baishen.lists@gmail.com <http://baishen.lists@gmail.com>
> wrote:

Can someone explain how the map reduce merge is done?  As far as I can tell, it appears to
pull all of the spill files into one giant file to send to the reducer.  Is this correct?
 Even if you set smaller spill files and a lower sort factor, the eventual merge is still
the same.  It just takes more passes to get there.

Thanks.




Mime
View raw message