hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: Question about running hadoop on multiple nodes/cores
Date Tue, 26 Oct 2010 06:04:39 GMT
On Tue, Oct 26, 2010 at 4:34 AM, Han Dong <handong32@gmail.com> wrote:
> Hi,
> In running hadoop, is it possible to specify the number of computing nodes
> to use? Or does hadoop automatically configures to run on different nodes?
> For example, if I specify 12 map tasks to run, and there are a cluster of 12
> computing nodes, will hadoop automatically run one map task per node, or
> would it run 2 maps per node for 6 nodes, if it detects the node has a 2
> core processor?

Hadoop MapReduce does not do it this way. It is data-driven instead.

If you are utilizing HDFS, then the tasks would be run on the node
that has the data (a block split or a file among many) local to
itself. Else a rack-local node is chosen based on its availability
(no. of slots free, etc.) and the task is assigned to it.

For your question: Hadoop will *try* to utilize ALL the nodes
(TaskTrackers) available to it. It does not let you specify the
tasktracker node to use directly. So yes, hadoop will "automatically
configure" the tasks to run on different nodes.

About that example: It depends on where the 12 map's input data blocks
are residing in the cluster. So say you have all 12 blocks on only a
single machine, and that the machine has a capacity of 12 maps, only
then all 12 mappers could be executed there itself.

[Data-locality -- Hadoop is pretty good at it, its methodology being
bringing computation to the data, not the other way round].

-- 
Harsh J
www.harshj.com

Mime
View raw message