hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philippe Gassmann <philippe.gassm...@anyware-tech.com>
Subject Re: Running tasks in the TaskTracker VM
Date Mon, 19 Mar 2007 17:51:53 GMT
Doug Cutting a écrit :
> Philippe Gassmann wrote:
>> At the moment, for each task (map or reduce) a new JVM is created by the
>> TaskTracker to run the Job.
>>
>> We have in our Hadoop cluster a high number of small files thus
>> requiring a high number of map tasks. I know this is suboptimal, but
>> aggregating those small files is not possible now. So an idea came to us
>> : launching jobs in the task tracker JVM so the overhead of creating a
>> new vm will disappear.
>
> A simpler approach might be to develop an InputFormat that includes
> multiple files per split.
>

Yes, but the issue remains present if you have to deal with a high
number of map tasks to distribute the load on many machines. Launching a
JVM is costly, let's say it costs 1 second (i'm optimistic) , if you
have to do 2000 map, there will be 2000 seconds lost in launching JVMs...

>> I already have a working patch against the 0.10.1 release of Hadoop that
>> launch tasks inside the TaskTracker JVM if a specific parameter is set
>> in the JobConf of the launched Job (for job we trust ;) ).
>
> Ideally this could be through a task-running interface, that permits
> one to plug in different implementations.  For example, sometimes it
> may make sense to run tasks in-process, sometimes to run them in a
> child JVM, and sometimes to fork a non-Java sub-process.  So, rather
> than specifying a flag on the job, one would specify the runner
> implementation class.
>

A bit of refactoring of the TaskRunner hierarchy is needed for this to
work : the code that launch tasks in the JVM or in a separate process is
very similar and it would have a sense that the TaskRunner would be the
superclass of a InJVMRunner and a ChildJVMRunner.
But what can we do with MapTaskRunner and ReduceTaskRunner ? It is not
acceptable to have let's say : 2 or more implementation of the
MapTaskRunner (one for in a child JVM execution, one for a in tracker
JVM execution...). It would be painful to maintain and very complicated.

> Doug


Mime
View raw message