hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <o...@yahoo-inc.com>
Subject Re: Running tasks in the TaskTracker VM
Date Tue, 20 Mar 2007 04:04:09 GMT

On Mar 19, 2007, at 10:51 AM, Philippe Gassmann wrote:

> Doug Cutting a écrit :
>> A simpler approach might be to develop an InputFormat that includes
>> multiple files per split.
> Yes, but the issue remains present if you have to deal with a high
> number of map tasks to distribute the load on many machines.  
> Launching a
> JVM is costly, let's say it costs 1 second (i'm optimistic) , if you
> have to do 2000 map, there will be 2000 seconds lost in launching  
> JVMs...

For task granularity, the most that makes sense is roughly 10-50  
tasks/node. Given that a node runs at least 2 tasks at once, it maps  
into 5-25 seconds of wallclock time. It is noticeable, but shouldn't  
be the dominant factor.

>>> I already have a working patch against the 0.10.1 release of  
>>> Hadoop that
>>> launch tasks inside the TaskTracker JVM if a specific parameter  
>>> is set
>>> in the JobConf of the launched Job (for job we trust ;) ).

Another possible direction would be to have the Task JVM ask for  
another Task before exiting. I believe that Ben Reed experimented  
with that and the changes were not too extensive. For security, you  
would want to limit the JVM reuse to tasks within the same job.

As a side note, we've already seen cases of client code that killed  
the task trackers. So it is hardly an abstract concern. *smile* (The  
client code managed to send kill signals to the entire process group,  
which included the task tracker. It was hard to debug and I'm not  
very interested in making it easier for client code to take out the  

-- Owen
View raw message