hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: Multiple tasktrackers per node
Date Thu, 25 May 2006 15:25:10 GMT
There are a few parameters that would need to be set.  
mapred.tasktracker.tasks.maximum specifies the maximum number of tasks 
per task tracker.  mapred.map.tasks sets the default number of map tasks 
per job.  Usually this is set to a multiple to the number of processors 
you have.  So if you have 5 nodes each with 4 cores you can set 
mapred.map.tasks to something like 100 (5 * 4 = 20 * 5 = 100) where we 
would run 5 tasks on each processor simultaneously.  
mapred.tasktracker.tasks.maximum would be set to say 25 (more than the 
20 tasks per node / tasktracker).

Those settings would configure tasks running but there are some other 
things to consider.  First mapred.map.task sets the default number of 
tasks meaning each job is broken into about that many number of tasks 
(usually that or a little more).  You may not want some tasks to run 
broken up into that many pieces because it takes longer to break up the 
task into say 100 pieces and process each piece then it would to say 
break it up into 5 pieces and run it.  So consider if the task is big 
enough to warrant the overhead.  Also there are settings such as 
mapred.submit.replication , mapred.speculative.execution, and 
mapred.reduce.parallel.copies which can be tuned to make the entire 
process run faster.

Try this and see if it gives you the results you are looking for.  To 
address running multiple tasktrackers per node, you can do that but you 
would have to modify the start-all.sh and stop-all.sh scripts to be able 
to start and stop the multiple trackers and you would probably need 
different install paths and configurations (hadoop-site.xml files) for 
each tasktracker as there are pid files to be concerned with.  
Personally I think that is a more difficult way to proceed.

Dennis

Gianlorenzo Thione wrote:
> Thanks for the answer. So far I am still trying to understand how each 
> tasktracker gets multiple map or reduce tasks to be executed 
> simultaneously. I have run a simple job with 53 map tasks on 5 nodes, 
> and at all times each node was executing a single task. Each cluster 
> node is a 4 core machine, so theoretically this was a 16-node cluster 
> and I feel that the resources were actually underutilized. Am I 
> missing something? Is there a parameter for a minimum number of tasks 
> to be executed in parallel (I found a parameter for setting a maximum 
> [which I set to 4])? If I run 4 TaskTrackers per node then each node 
> gets a map task at the same time and execution seems overall much faster.
>
> I'd appreciate help and insights with respect to this matter. 
> Eventually each map task in our application will synchronize with an 
> external single-threaded cpu-intensive process to process data (thus 
> using the tasktracker as a driver for these processes). We need to 
> make sure that each node is utilized at its maximum capacity at all 
> times by running 4 instances of those single-threaded processes and in 
> order to achieve that we'd need each TaskTracker being handed on 
> average 4 map jobs at a time, each to be run concurrently in a 
> different thread. Is there a way to guarantee that this happen? In 
> alternative we can  always run 4 TaskTracker per node, which was our 
> original plan, but if there are better/smarter way to do this, that 
> would be the best solution.
>
> Thanks in advance!
>
> Lorenzo Thione
>
> On May 24, 2006, at 7:31 AM, Dennis Kubes wrote:
>
>> Using Java 5 will allow the threads of various tasks to take 
>> advantage of multiple processors.  Just make sure you set you map 
>> tasks property to a multiple of the number of processors total.  We 
>> are running multi-core machines and are seeing good utilization 
>> across all cores this way.
>>
>> Dennis
>>
>>
>>
>> Gianlorenzo Thione wrote:
>>> Hello everybody,
>>>
>>> I'll ask my first question on this forum and hopefully start 
>>> building more and more understanding of hadoop so that we can 
>>> eventually contribute actively. In the meanwhile, I have a simple 
>>> issue/question/suggestion....
>>>
>>> I have many multi-core, multi-processor nodes in my cluster and I'd 
>>> like to be able to run several tasktrackers and datanode per 
>>> physical machine. I am modifying the startup scripts so that a 
>>> number of worker JVMs can be started on each node, maxed out at the 
>>> number of CPUs seen by the kernel.
>>>
>>> Since our map jobs are highly CPU intensive it makes sense to run 
>>> parallel jobs on each node, maximizing the CPU utilization.
>>>
>>> Is that something that would make sense to roll back in the scripts 
>>> for hadoop as well? Anybody else running on multi processor 
>>> architectures?
>>>
>>> Lorenzo Thione
>>> Powerset, Inc.
>>>
>

Mime
View raw message