hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan A. Pendleton" ...@geekdom.net>
Subject Re: Multiple tasktrackers per node
Date Thu, 25 May 2006 18:18:28 GMT
I would still like to see some of these site preferences be more
dynamic. For instance, I will soon be using both single CPU and dual
CPU machines, with varying amounts of RAM. I'd happily have an extra
job or 2 scheduled on the dual CPU machines, to keep them utilized and
take better advantage of the RAM (which is mostly serving as disk
cache for my current loads). But, there's no way to set a different
tasks.maximum for each node (or a concept of "class of node") at this
point. If I set the value too high - tasks are going to be more likely
to fail on the lower-class nodes. Too low, and I won't use the whole
cluster effectively.

Adapting to variability of resource is still a big problem across
hadoop. Performance still drops off very rapidly in many cases if you
have a weak node - there's no speculative reduce execution, bugs in
speculative map execution, bad handling of filled-up space during DFS
writes, as well as MapOutputFile writes. In fact, anything that calls
"getLocalPath" gets uniformly spread across available drives, with no
"full" checking - filling up any one drive on the entire cluster can
cause all kinds of things to fail.

On 5/25/06, Dennis Kubes <nutch-dev@dragonflymc.com> wrote:
> There are a few parameters that would need to be set.
> mapred.tasktracker.tasks.maximum specifies the maximum number of tasks
> per task tracker.  mapred.map.tasks sets the default number of map tasks
> per job.  Usually this is set to a multiple to the number of processors
> you have.  So if you have 5 nodes each with 4 cores you can set
> mapred.map.tasks to something like 100 (5 * 4 = 20 * 5 = 100) where we
> would run 5 tasks on each processor simultaneously.
> mapred.tasktracker.tasks.maximum would be set to say 25 (more than the
> 20 tasks per node / tasktracker).
> Those settings would configure tasks running but there are some other
> things to consider.  First mapred.map.task sets the default number of
> tasks meaning each job is broken into about that many number of tasks
> (usually that or a little more).  You may not want some tasks to run
> broken up into that many pieces because it takes longer to break up the
> task into say 100 pieces and process each piece then it would to say
> break it up into 5 pieces and run it.  So consider if the task is big
> enough to warrant the overhead.  Also there are settings such as
> mapred.submit.replication , mapred.speculative.execution, and
> mapred.reduce.parallel.copies which can be tuned to make the entire
> process run faster.
> Try this and see if it gives you the results you are looking for.  To
> address running multiple tasktrackers per node, you can do that but you
> would have to modify the start-all.sh and stop-all.sh scripts to be able
> to start and stop the multiple trackers and you would probably need
> different install paths and configurations (hadoop-site.xml files) for
> each tasktracker as there are pid files to be concerned with.
> Personally I think that is a more difficult way to proceed.
> Dennis

Bryan A. Pendleton
Ph: (877) geek-1-bp
View raw message