hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: [Lucene-hadoop Wiki] Update of "FAQ" by DevarajDas
Date Thu, 06 Sep 2007 18:08:25 GMT
Apache Wiki wrote:
> + Sort performances on 1400 nodes and 2000 nodes are pretty good too - sorting 14TB of
data on a 1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster takes 2.5
hours. The updates to the above configuration being: 
> +   * `mapred.job.tracker.handler.count = 60`
> +   * `mapred.reduce.parallel.copies = 50`
> +   * `tasktracker.http.threads = 50`

This is a pretty good indication of stuff that we might better specify 
as proportional to cluster size.  For example, we might replace the 
first with something like mapred.jobtracker.tasks.per.handler=30.  To 
determine the number of handlers we'd determine the number of task slots 
(#nodes * mapred.tasktracker.tasks.maximum) and divide that by 
tasks.per.handler to determine the number of handlers.  Then folks 
wouldn't need to alter these settings as their cluster grows.

It's best if folks don't have to change defaults for good performance. 
Not only does that simplify configuration, but it means we can more 
easily change implementations.  For example, if we switch to async RPC 
responses, then the handler count may change significantly, and we'll 
probably change the default, and it would be nice if most folks were not 
overriding the default.

Thoughts?  Should we file an issue?


View raw message