hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Multiple tasktrackers per node
Date Thu, 25 May 2006 18:34:11 GMT
Bryan A. Pendleton wrote:
> I would still like to see some of these site preferences be more
> dynamic. For instance, I will soon be using both single CPU and dual
> CPU machines, with varying amounts of RAM. I'd happily have an extra
> job or 2 scheduled on the dual CPU machines, to keep them utilized and
> take better advantage of the RAM (which is mostly serving as disk
> cache for my current loads). But, there's no way to set a different
> tasks.maximum for each node (or a concept of "class of node") at this
> point.

Sure there is: a separate config file per node.

If you'd like to make this automatic, that would be great.  We'd need 
portable Java code to detect the amount of memory and number of CPUs. 
Perhaps this could be done by running some shell commands, parsing their 
output, relying on cygwin for Windows support?

Owen's recent benchmark posting showed how machines with 5x performance 
variation were effectively used during map, but that slow machines still 
affect reduce performance.  He's submitted a bug and will likely fix it 
(if past experience is any guide):


> Adapting to variability of resource is still a big problem across
> hadoop. Performance still drops off very rapidly in many cases if you
> have a weak node - there's no speculative reduce execution, bugs in
> speculative map execution, bad handling of filled-up space during DFS
> writes, as well as MapOutputFile writes. In fact, anything that calls
> "getLocalPath" gets uniformly spread across available drives, with no
> "full" checking - filling up any one drive on the entire cluster can
> cause all kinds of things to fail.

Sounds like a good list of things to work on.  Want to take on solving 
any of these?  They won't fix themselves...


View raw message