hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer <awittena...@linkedin.com>
Subject Re: Question about setting the number of mappers.
Date Tue, 19 Jan 2010 21:47:16 GMT

This means that -every node- can have 10 map tasks and 10 reduce tasks
running simultaneously.  If you want to limit how many tasks run -per node-
in order to change the performance profile, then you need to lower this.

10/10 is generally way way way too high.  Rule of thumb says you set both of
these based on # of cores.  So an 8 core box would do 4/4.  Start with that
number then start profiling the system usage and go up/down based upon what
you see.

On 1/19/10 12:59 PM, "Teryl Taylor" <teryl.taylor@gmail.com> wrote:

> Hi Allen,
> 
> I have set them both to 10 in the mapred-site.xml file.
> 
> Cheers,
> 
> Teryl
> 
> 
> On Tue, Jan 19, 2010 at 4:32 PM, Allen Wittenauer
> <awittenauer@linkedin.com>wrote:
> 
>> What is the value of:
>> 
>> mapred.tasktracker.map.tasks.maximum
>> mapred.tasktracker.reduce.tasks.maximum
>> 
>> 
>> On 1/19/10 10:23 AM, "Teryl Taylor" <teryl.taylor@gmail.com> wrote:
>> 
>>> Hi guys,
>>> 
>>> Thanks for the answers.   Michael, yes you are right, that is what I
>> guess,
>>> I'm looking for...how to reduce the number of mappers running
>>> simultaneously.  The system is running really slow and I think it might
>> be
>>> due to constant thread context switching because of so many Mappers
>> running
>>> concurrently.   Is there a way to tell how many Mappers are running at
>> the
>>> same time?  My concern is that even though I set mapred.map.tasks to 10,
>>> the job configuration file (i.e. the job.xml file that is generated to
>> the
>>> logs directory) always says mapred.map.tasks is 1083  which makes me
>> believe
>>> it is completely ignoring my setting.  This is confirmed by the snippet
>> of
>>> code from the JobClient.java file in which the client seems to ask for
>> the
>>> number of input splits and automatically sets mapred.map.tasks to that
>>> number...totally ignoring my setting.
>>> 
>>> Cheers,
>>> 
>>> Teryl
>>> 
>>> 
>>> 
>>> On Tue, Jan 19, 2010 at 12:14 PM, Clements, Michael <
>>> Michael.Clements@disney.com> wrote:
>>> 
>>>>  Do you want to change the total # of mappers, or the # that run at any
>>>> given time? For example, Hadoop may split your job into 1083 mapper
>> tasks,
>>>> only 10 of which it allows to run at a time.
>>>> 
>>>> 
>>>> 
>>>> The setting in mapred-site.xml caps how many mappers can run
>> simultaneously
>>>> per machine. It does not affect how many total mappers will make up the
>> job.
>>>> 
>>>> 
>>>> 
>>>> Total # of tasks in a job, and the # that can run simultaneously, are
>> two
>>>> separate settings. Both are important for tuning performance. The
>>>> InputFormat controls the first.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> *From:* mapreduce-user-return-292-Michael.Clements=disney.com@
>>>> hadoop.apache.org [mailto:mapreduce-user-return-292-Michael.Clements=
>>>> disney.com@hadoop.apache.org] *On Behalf Of *Jeff Zhang
>>>> *Sent:* Monday, January 18, 2010 4:54 PM
>>>> *To:* mapreduce-user@hadoop.apache.org
>>>> *Subject:* Re: Question about setting the number of mappers.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Hi Teryl
>>>> 
>>>> The number of mapper is determined by the InputFormat you use, in your
>>>> case, one way is to merge the files to large file beforehand, or use the
>>>> CombineFileInputFormat as your InputFormat.
>>>> 
>>>> 
>>>> 
>>>>  On Mon, Jan 18, 2010 at 1:05 PM, Teryl Taylor <teryl.taylor@gmail.com>
>>>> wrote:
>>>> 
>>>> Hi everyone,
>>>> 
>>>> I'm playing around with the Hadoop map reduce library and I'm getting
>> some
>>>> odd behaviour.  The system is setup on one machine using the pseudo
>>>> distributed configuration.  I use KFS as my DFS.   I have written a
>>>> MapReduce program to process a bunch of binary files.  The files are
>>>> compressed in chunks, so I do not split the files.  There are 1083 files
>>>> that I am loading into the map reducer.
>>>> 
>>>> Everytime I run the map reducer:
>>>> 
>>>> ~/hadoop/bin/hadoop jar  /home/hadoop/lib/test.jar
>>>> org.apache.hadoop.mapreduce.apps.Test /input/2007/*/*/  /output
>>>> 
>>>> It always creates 1083 mapper tasks to do the processing..which is
>>>> extremely slow....so I wanted to try and lower the number to 10 and see
>> how
>>>> the performance is.  I set the following in  mapred-site.xml:
>>>> 
>>>> <configuration>
>>>>  <property>
>>>>    <name>mapred.job.tracker</name>
>>>>    <value>localhost:9001</value>
>>>>  </property>
>>>>  <property>
>>>>   <name>mapred.tasktracker.map.tasks.maximum</name>
>>>>   <value>10</value>
>>>>  </property>
>>>>  <property>
>>>>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>>>   <value>10</value>
>>>>  </property>
>>>>  <property>
>>>>    <name>mapred.map.tasks</name>
>>>>    <value>10</value>
>>>>  </property>
>>>> </configuration>
>>>> 
>>>> Have recycled the jobtracker and tasktracker and I still always get 1083
>>>> mappers.   The map reducer is working as expected other than this.  I'm
>>>> using the new API and my main function in my class looks like:
>>>> 
>>>>  public static void main(String[] args) throws Exception {
>>>>     Configuration conf = new Configuration();
>>>>     String[] otherArgs = new GenericOptionsParser(conf,
>>>> args).getRemainingArgs();
>>>>     if (otherArgs.length != 2) {
>>>>       System.err.println("Usage: wordcount <in> <out>");
>>>>       System.exit(2);
>>>>     }
>>>> //    conf.setInt("mapred.map.tasks", 10);
>>>>     Job job = new Job(conf, "word count");
>>>>     conf.setNumMapTasks(10);
>>>>     job.setJarByClass(Test.class);
>>>>     job.setMapperClass(TestMapper.class);
>>>>     job.setCombinerClass(IntSumReducer.class);
>>>>     job.setReducerClass(IntSumReducer.class);
>>>>     job.setOutputKeyClass(Text.class);
>>>>     job.setOutputValueClass(IntWritable.class);
>>>>     job.setInputFormatClass(CustomInputFormat.class);
>>>>     job.setOutputFormatClass(TextOutputFormat.class);
>>>>     FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
>>>>     FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
>>>>     System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>   }
>>>> 
>>>> I have been digging around in the hadoop source code and it looks like
>> the
>>>> JobClient actually sets the mappers to the number of splits (hard
>>>> coded)....snippet from the JobClient class:
>>>> 
>>>> 
>> *****************************************************************************
>>>> 
>> *****************************************************************************
>>>> 
>> 
>> *****************************************************************************
>> >>
>> *
>>>> /**
>>>>    * Internal method for submitting jobs to the system.
>>>>    * @param job the configuration to submit
>>>>    * @return a proxy object for the running job
>>>>    * @throws FileNotFoundException
>>>>    * @throws ClassNotFoundException
>>>>    * @throws InterruptedException
>>>>    * @throws IOException
>>>>    */
>>>>   public
>>>>   RunningJob submitJobInternal(JobConf job
>>>>                                ) throws FileNotFoundException,
>>>>                                         ClassNotFoundException,
>>>>                                         InterruptedException,
>>>>                                         IOException {
>>>>     /*
>>>>      * configure the command line options correctly on the submitting
>> dfs
>>>>      */
>>>> 
>>>>     JobID jobId = jobSubmitClient.getNewJobId();
>>>>     Path submitJobDir = new Path(getSystemDir(), jobId.toString());
>>>>     Path submitJarFile = new Path(submitJobDir, "job.jar");
>>>>     Path submitSplitFile = new Path(submitJobDir, "job.split");
>>>>     configureCommandLineOptions(job, submitJobDir, submitJarFile);
>>>>     Path submitJobFile = new Path(submitJobDir, "job.xml");
>>>>     int reduces = job.getNumReduceTasks();
>>>>     JobContext context = new JobContext(job, jobId);
>>>> 
>>>>     // Check the output specification
>>>>     if (reduces == 0 ? job.getUseNewMapper() : job.getUseNewReducer()) {
>>>>       org.apache.hadoop.mapreduce.OutputFormat<?,?> output =
>>>>         ReflectionUtils.newInstance(context.getOutputFormatClass(),
>> job);
>>>>       output.checkOutputSpecs(context);
>>>>     } else {
>>>>       job.getOutputFormat().checkOutputSpecs(fs, job);
>>>>     }
>>>> 
>>>>     // Create the splits for the job
>>>>     LOG.debug("Creating splits at " +
>> fs.makeQualified(submitSplitFile));
>>>>     int maps;
>>>>     if (job.getUseNewMapper()) {
>>>>       maps = writeNewSplits(context, submitSplitFile);
>>>>     } else {
>>>>       maps = writeOldSplits(job, submitSplitFile);
>>>>     }
>>>>     job.set("mapred.job.split.file", submitSplitFile.toString());
>>>>     job.setNumMapTasks(maps);
>>>> 
>>>>     // Write job file to JobTracker's fs
>>>>     FSDataOutputStream out =
>>>>       FileSystem.create(fs, submitJobFile,
>>>>                         new FsPermission(JOB_FILE_PERMISSION));
>>>> 
>>>>     try {
>>>>       job.writeXml(out);
>>>>     } finally {
>>>>       out.close();
>>>>    .....
>>>> 
>>>> 737,0-1       39%
>>>> }
>>>> 
>>>> 
>>>> 
>> *****************************************************************************
>>>> 
>> *****************************************************************************
>>>> 
>> *****************************************************************************
>>>> ********
>>>> 
>>>> Is there anything I can do to get the number of mappers to be more
>>>> flexible?
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> Teryl
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best Regards
>>>> 
>>>> Jeff Zhang
>>>> 
>> 
>> 


Mime
View raw message