hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer ...@apache.org>
Subject Re: How does Hadoop manage memory?
Date Thu, 30 Jun 2011 19:00:38 GMT

On Jun 28, 2011, at 1:43 PM, Peter Wolf wrote:

> Hello all,
> I am looking for the right thing to read...
> I am writing a MapReduce Speech Recognition application.  I want to run many Speech Recognizers
in parallel.
> Speech Recognizers not only use a large amount of processor, they also use a large amount
of memory.  Also, in my application, they are often idle much of the time waiting for data.
 So optimizing what runs when is non-trivial.
> I am trying to better understand how Hadoop manages resources.  Does it automatically
figure out the right number of mappers to instantiate?

	The number of mappers correlates to the number of InputSplits, which is based upon the InputFormat.
 In most cases, this is equivalent to the number of blocks.  So if a file is composed of 3
blocks, it will generate 3 mappers.  Again, depending upon the InputFormat, the size of these
splits may be manipulated via job settings.

>  How?  What happens when other people are sharing the cluster?  What resource management
is the responsibility of application developers?

	Realistically, *all* resource management is the responsibility of the operations and development
teams.  The only real resource protection/allocation system that Hadoop provides is task slots
and, if enabled, some memory protection in the form of "don't go over this much".    On multi-tenant
systems, a good neighbor view of the world should be adopted.

> For example, let's say each Speech Recognizer uses 500 MB, and I have 1,000,000 files
to process.  What would happen if I made 1,000,000 mappers, each with 1 Speech Recognizer?

	At 1m mappers, the JobTracker would likely explode under the weight first unless the Heap
size was raised significantly.  Each value that you see on the JT page--including those for
each task--are kept in main memory.  

> Is it only non-optimal because of setup time, or would the system try to allocate 500GB
of memory and explode?

	If you have 1m map slots, yes, it would allocate .5TB of mem spread across each node.

View raw message