hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dale McDiarmid <d...@ravn.co.uk>
Subject Re: Large server recommedations
Date Thu, 15 Dec 2011 21:58:26 GMT
thanks matt,
Assuming therefore i run a single tasktracker and have 48 cores 
available.  Based on your recommendation of 2:1 mappers to reducer 
threads i will be assigning:


This brings me onto my question:

"Can i confirm mapred.map.tasks and mapred.reduce.tasks*are these 
JobTracker parameters*? The recommendation for these settings seems to 
related to the number of task trackers. In my architecture, i have 
potentially only 1 if a single task tracker can only be configured on 
each host. What should i set these values to therefore considering the 
box spec?"

I have read:

mapred.local.tasks = 10x of task trackers
mapred.reduce.tasks=2x task trackers

Given i have a single task tracker, with multiple concurrent processes 
does this equates to:

mapred.local.tasks =300?

Some reasoning behind these values appreciated...

appreciate this is a little simplified and we will need to profile. Just 
looking for a sensible starting position.

On 15/12/2011 16:43, GOEKE, MATTHEW (AG/1000) wrote:
> Dale,
> Talking solely about hadoop core you will only need to run 4 daemons on that machine:
Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to run multiple of any
of them as the tasktracker will spawn multiple child jvms which is where you will get your
task parallelism. When you set your mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum
configurations you will limit the upper bound of the child jvm creation but this needs to
be configured based on job profile (I don't know much about Mahoot but traditionally I setup
the clusters as 2:1 mappers to reducers until the profile proves otherwise). If you look at
blogs / archives you will see that you can assign 1 child task per *logical* core (e.g. hyper
threaded core) and to be safe you will want 1 daemon per *physical* core so you can divvy
it up based on that recommendation.
> To summarize the above: if you are sharing the same IO pipe / box then there is no reason
to have multiple daemons running because you are not really gaining anything from that level
of granularity. Others might disagree based on virtualization but in your case I would say
save yourself the headache and keep it simple.
> Matt
> -----Original Message-----
> From: Dale McDiarmid [mailto:dale@ravn.co.uk]
> Sent: Thursday, December 15, 2011 1:50 PM
> To: common-user@hadoop.apache.org
> Subject: Large server recommedations
> Hi all
> New to the community and using hadoop and was looking for some advice as
> to optimal configurations on very large servers.  I have a single server
> with 48 cores and 512GB of RAM and am looking to perform an LDA analysis
> using Mahoot across approx 180 million documents.  I have configured my
> namenode and job tracker.  My questions are primarily around the optimal
> number of tasktrackers and data nodes.  I have had no issues configuring
> multiple datanodes, each which could potentially be utilised its own
> disk location (underlying disk is SAN - solid state).
> However, from my reading the typical architecture for hadoop is a larger
> number of smaller nodes with a single tasktracker on each host.  Could
> someone please clarify the following:
> 1. Can multiple task trackers be run on a single host? If so, how is
> this configured as it doesn't seem possible to control the host:port.
> 2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker
> parameters? The recommendation for these settings seems to related to
> the number of task trackers.  In my architecture, i have potentially
> only 1 if a single task tracker can only be configured on each host.
> What should i set these values to therefore considering the box spec?
> 3. I noticed the parameters mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum - do these control the number of
> JVM processes spawned to handle the respective steps? Is a tasktracker
> with 48 configured equivalent to a 48 task trackers with a value of 1
> configured for these values?
> 4. Benefits of a large number of datanodes on a single large server? I
> can see value where the host has multiple IO interfaces and disk sets to
> avoid IO contention. In my case, however, a SAN negates this.  Are there
> still benefits of multiple datanodes outside of resiliency and potential
> increase of data transfer i.e. assuming a single data node is limited
> and single threaded?
> 5. Any other thoughts/recommended settings?
> Thanks
> Dale
> This e-mail message may contain privileged and/or confidential information, and is intended
to be received only by persons entitled
> to receive such information. If you have received this e-mail in error, please notify
the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other use of this e-mail
by you is strictly prohibited.
> All e-mails and attachments sent and received are subject to monitoring, reading and
archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for checking for the
presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any damage caused by
any such code transmitted by or accompanying
> this e-mail or any attachment.
> The information contained in this email may be subject to the export control laws and
regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR) and sanctions
regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information
you are obligated to comply with all
> applicable U.S. export laws and regulations.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message