hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dale McDiarmid <d...@ravn.co.uk>
Subject Large server recommedations
Date Thu, 15 Dec 2011 19:50:04 GMT
Hi all
New to the community and using hadoop and was looking for some advice as 
to optimal configurations on very large servers.  I have a single server 
with 48 cores and 512GB of RAM and am looking to perform an LDA analysis 
using Mahoot across approx 180 million documents.  I have configured my 
namenode and job tracker.  My questions are primarily around the optimal 
number of tasktrackers and data nodes.  I have had no issues configuring 
multiple datanodes, each which could potentially be utilised its own 
disk location (underlying disk is SAN - solid state).

However, from my reading the typical architecture for hadoop is a larger 
number of smaller nodes with a single tasktracker on each host.  Could 
someone please clarify the following:

1. Can multiple task trackers be run on a single host? If so, how is 
this configured as it doesn't seem possible to control the host:port.

2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker 
parameters? The recommendation for these settings seems to related to 
the number of task trackers.  In my architecture, i have potentially 
only 1 if a single task tracker can only be configured on each host.  
What should i set these values to therefore considering the box spec?

3. I noticed the parameters mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum - do these control the number of 
JVM processes spawned to handle the respective steps? Is a tasktracker 
with 48 configured equivalent to a 48 task trackers with a value of 1 
configured for these values?

4. Benefits of a large number of datanodes on a single large server? I 
can see value where the host has multiple IO interfaces and disk sets to 
avoid IO contention. In my case, however, a SAN negates this.  Are there 
still benefits of multiple datanodes outside of resiliency and potential 
increase of data transfer i.e. assuming a single data node is limited 
and single threaded?

5. Any other thoughts/recommended settings?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message