hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "GOEKE, MATTHEW (AG/1000)" <matthew.go...@monsanto.com>
Subject RE: controlling no. of mapper tasks
Date Mon, 20 Jun 2011 19:48:40 GMT
Praveen,

David is correct but we might need to use different terminology. Hadoop looks at the number
of input splits and if the file is not splittable then yes it will only use 1 mapper for it.
In the case of most files (which are splittable) Hadoop will break them into multiple maps
and work over each one. What you need to take a look at is the number of concurrent mappers
/ reducers that you have defined per node so that you do not cause context switches due to
too many processes per core. Take a look in mapred-site.xml and you will see a default defined
(if not take a look at the default mapred-site.xml for your version).

Matt

-----Original Message-----
From: praveen.peddi@nokia.com [mailto:praveen.peddi@nokia.com] 
Sent: Monday, June 20, 2011 2:44 PM
To: mapreduce-user@hadoop.apache.org
Subject: RE: controlling no. of mapper tasks

Hi David,
I think Hadoop is looking at the data size, not the no. of input files. If I pass in .gz files,
then yes hadoop is choosing 1 map task per file but if I pass in HUGE text file or same file
split into 10 files, its choosing same no. of maps tasks (191 in my case).

Thanks
Praveen

-----Original Message-----
From: ext David Rosenstrauch [mailto:darose@darose.net] 
Sent: Monday, June 20, 2011 3:39 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: controlling no. of mapper tasks

On 06/20/2011 03:24 PM, praveen.peddi@nokia.com wrote:
> Hi there, I know client can send "mapred.reduce.tasks" to specify no.
> of reduce tasks and hadoop honours it but "mapred.map.tasks" is not 
> honoured by Hadoop. Is there any way to control number of map tasks?
> What I noticed is that Hadoop is choosing too many mappers and there 
> is an extra overhead being added due to this. For example, when I have 
> only 10 map tasks, my job finishes faster than when Hadoop chooses 191 
> map tasks. I have 5 slave cluster and 10 tasks can run in parallel. I 
> want to set both map and reduce tasks to be 10 for max efficiency.
>
> Thanks Praveen

The number of map tasks is determined dynamically based on the number of input chunks you
have.  If you want fewer map tasks either pass fewer input files to your job, or store the
files using larger chunk sizes (which will result in fewer chunks per file, and thus fewer
chunks total).

HTH,

DR
This e-mail message may contain privileged and/or confidential information, and is intended
to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the
sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail
by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival
by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence
of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such
code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations
of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations
issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you
are obligated to comply with all
applicable U.S. export laws and regulations.


Mime
View raw message