Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of thione@gmail.com designates
 64.233.162.207 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:in-reply-to:references:mime-version:content-type:message-id:cc:content-transfer-encoding:from:subject:date:to:x-mailer;
        b=NvbRqwio5r6sNLX0C2FEGb/N1tKNA1tVn/hG2ENcnjdqy1sZMLY1C4I9U/l5Pi8Ixq0NdrY+lTz7ONhcO0NpcQ2265rmvEF3Qaituy/atr3MnF2nURCrAaEFracHO7HCBZrnSJ+DIG5PgzyY7+7ba1CC8IQ4qQmssRemKHex2KE=
In-Reply-To: <44746E29.1030803@dragonflymc.com>
References: <5438AA87-1469-4F49-BABF-43E3A6BD1856@gmail.com>
 <D878EB96-A840-4CE0-B8B9-156778BBD6C3@mac.com>
 <44746E29.1030803@dragonflymc.com>
Mime-Version: 1.0 (Apple Message framework v750)
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <563DFE2F-2EEE-4DB8-AE68-268806A7018E@gmail.com>
Cc: Barney Pell <barney@powerset.com>
Content-Transfer-Encoding: 7bit
From: Gianlorenzo Thione <thione@gmail.com>
Subject: Re: Multiple tasktrackers per node
Date: Wed, 24 May 2006 22:51:28 -0700
To: hadoop-user@lucene.apache.org

Thanks for the answer. So far I am still trying to understand how  
each tasktracker gets multiple map or reduce tasks to be executed  
simultaneously. I have run a simple job with 53 map tasks on 5 nodes,  
and at all times each node was executing a single task. Each cluster  
node is a 4 core machine, so theoretically this was a 16-node cluster  
and I feel that the resources were actually underutilized. Am I  
missing something? Is there a parameter for a minimum number of tasks  
to be executed in parallel (I found a parameter for setting a maximum  
[which I set to 4])? If I run 4 TaskTrackers per node then each node  
gets a map task at the same time and execution seems overall much  
faster.

I'd appreciate help and insights with respect to this matter.  
Eventually each map task in our application will synchronize with an  
external single-threaded cpu-intensive process to process data (thus  
using the tasktracker as a driver for these processes). We need to  
make sure that each node is utilized at its maximum capacity at all  
times by running 4 instances of those single-threaded processes and  
in order to achieve that we'd need each TaskTracker being handed on  
average 4 map jobs at a time, each to be run concurrently in a  
different thread. Is there a way to guarantee that this happen? In  
alternative we can  always run 4 TaskTracker per node, which was our  
original plan, but if there are better/smarter way to do this, that  
would be the best solution.

Thanks in advance!

Lorenzo Thione

On May 24, 2006, at 7:31 AM, Dennis Kubes wrote:

> Using Java 5 will allow the threads of various tasks to take  
> advantage of multiple processors.  Just make sure you set you map  
> tasks property to a multiple of the number of processors total.  We  
> are running multi-core machines and are seeing good utilization  
> across all cores this way.
>
> Dennis
>
>
>
> Gianlorenzo Thione wrote:
>> Hello everybody,
>>
>> I'll ask my first question on this forum and hopefully start  
>> building more and more understanding of hadoop so that we can  
>> eventually contribute actively. In the meanwhile, I have a simple  
>> issue/question/suggestion....
>>
>> I have many multi-core, multi-processor nodes in my cluster and  
>> I'd like to be able to run several tasktrackers and datanode per  
>> physical machine. I am modifying the startup scripts so that a  
>> number of worker JVMs can be started on each node, maxed out at  
>> the number of CPUs seen by the kernel.
>>
>> Since our map jobs are highly CPU intensive it makes sense to run  
>> parallel jobs on each node, maximizing the CPU utilization.
>>
>> Is that something that would make sense to roll back in the  
>> scripts for hadoop as well? Anybody else running on multi  
>> processor architectures?
>>
>> Lorenzo Thione
>> Powerset, Inc.
>>