hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mu Qiao <qiao...@gmail.com>
Subject Re: Why is Spilled Records always equal to Map output records
Date Wed, 15 Jul 2009 02:29:46 GMT
Thanks. But when I refer to "Hadoop: The Definitive Guide" chapter 6, I find
that the map writes its outputs to a memory buffer(not to local disk) whose
size is controlled by io.sort.mb. Only the buffer reaches its threshold, it
will spill the outputs to local disk. If that is true, I can't see any need
for the map to store its outputs to disk if the io.sort.mb is large enough.

On Wed, Jul 15, 2009 at 12:45 AM, Owen O'Malley <owen.omalley@gmail.com>wrote:

> There is no requirement that all of the reduces are running while the map
> is
> running. The dataflow is that the map writes its output to local disk and
> that the reduces pull the map outputs when they need them. There are
> threads
> handling sorting and spill of the records to disk, but that doesn't remove
> the need for the map to store its outputs to disk. (Of course, if there is
> enough ram, the operating system will have the map outputs in its file
> cache
> and not need to read from disk.)
> It is an interesting question as to what the changes would need to be to
> have the maps push to the reduces, but they would be substantial.
> -- Owen

Best wishes,
Qiao Mu

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message