hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@yahoo-inc.com>
Subject Re: setting # of maps for a job
Date Wed, 23 Jan 2008 14:27:35 GMT

On Jan 23, 2008, at 5:55 AM, Khalil Honsali wrote:

> thanks,
>
> from the API:
> "Thus, if you expect 10TB of input data and have a blocksize of 128MB,
> you'll end up with 82,000 maps, unless
> setNumMapTasks(int)<http://lucene.apache.org/hadoop/docs/r0.15.2/ 
> api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks%28int%29>is
> used to set it even higher."
>
> then, setNumMapTasks does not explictly tell hadoop how many map tasks
> should be run in parallel for each node?
>

Uh. setNumMapTasks is a hint to the Map-Reduce framework for the  
total number of maps for a given job (i.e. given input data-set). It  
is what you are asking?

If not, are you trying to explicitly control the number of maps  
simultaneously run on a given TaskTracker? There is a per-tracker  
setting 'mapred.tasktracker.tasks.maximum' for the TaskTracker (which  
has been split up to mapred.tasktracker.map.tasks.maximum and  
mapred.tasktracker.reduce.tasks.maximum in 0.16.0). All these are job- 
agnostic settings i.e. the TaskTracker doesn't care which job's map/ 
reduce tasks it is running.

If you are trying to limit a given job's parallelism on a given node,  
we don't have that feature.

hth,
Arun

>
>
> On 23/01/2008, Arun C Murthy <acm@yahoo-inc.com> wrote:
>>
>>
>> On Jan 22, 2008, at 10:22 PM, Khalil Honsali wrote:
>>
>>> Hi,
>>>
>>> I am experiencing a similar problem, even after varying [blocksize],
>>> [splitsize] and [num map tasks] in both API and hadoop-site.xml;
>>> the num of
>>> map tasks was 8 instead of expected 20 on a 4 node cluster.
>>>
>>> I am working with text files, there is an issue about this where the
>>> solution suggests to zip the files so that a single zip >> block.
>>> http://www.mail-archive.com/hadoop-user@lucene.apache.org/
>>> msg02836.html
>>>
>>> However, I still don't understand two issues:
>>> - what is the relations between num files, file size, block size,
>>> split size
>>> and num map tasks.
>>
>> http://hadoop.apache.org/core/docs/r0.15.2/mapred_tutorial.html#How
>> +Many+Maps?
>> and the javadoc for JobConf.setNumMapTasks:
>> http://lucene.apache.org/hadoop/docs/r0.15.2/api/org/apache/hadoop/
>> mapred/JobConf.html#setNumMapTasks(int)
>>
>> hth,
>> Arun
>>
>>> - what if I wanted to serve the text files directly from HDFS to
>>> HTTP, I
>>> don't want to zip and unzip them each time right? how to configure
>>> hadoop so
>>> that it works best with small files directly (maybe not designed
>>> for that?)
>>>
>>> Finally, I wonder if it would be useful to have a tool for  
>>> estimating
>>> optimum performance based on the workload parameters, instead of
>>> manual
>>> trial/error.
>>>
>>>
>>> thanks very much!
>>>
>>>
>>> On 23/01/2008, Ted Dunning <tdunning@veoh.com> wrote:
>>>>
>>>>
>>>>
>>>> Setting the number of maps lower than would otherwise be used is
>>>> useful if
>>>> you have a job that should not clog up the cluster.  If you don't
>>>> need it
>>>> to
>>>> run quickly, then you can set m = N / 5 or so and get slow
>>>> progress with
>>>> small impact on the throughput of the cluster.
>>>>
>>>> IF and when hadoop-2573 gets resolve, then there will be a much
>>>> better
>>>> answer for this.
>>>>
>>>>
>>>> On 1/22/08 8:01 PM, "Amar Kamat" <amarrk@yahoo-inc.com> wrote:
>>>>
>>>>> Hi,
>>>>> You can't directly control the number of maps. Its based on the
>>>>> splits
>>>>> of the data residing on the DFS. The number one provides using
>>>>> command-line/code/the conf-files are hints to HADOOP. I guess
>>>>> this is
>>>>> for the reason that if the #maps (set externally) is less than the
>>>>> #splits, we might end up migrating the data which is a
>>>>> performance hit.
>>>>> There could be other reasons too.
>>>>> Amar
>>>>> Stefan Groschupf wrote:
>>>>>> Hi,
>>>>>> I have trouble setting the number of maps for a job with version
>>>>>> 15.1.
>>>>>> As far I understand I can configure the number of maps that a
>>>>>> job will
>>>>>> do in an hadoop-site.xml on the box where I submit the job  
>>>>>> (that is
>>>>>> not the jobtracker box).
>>>>>> However my configuration is always ignored. Also changing the
>>>>>> value in
>>>>>> the hadoop-site on the jobtracker box and restarting the nodes  
>>>>>> does
>>>>>> not help.
>>>>>> Also I do not set the number via API.
>>>>>> Any ideas where I might oversee something?
>>>>>> Thanks for any hints,
>>>>>> Stefan
>>>>>>
>>>>>>
>>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>> 101tec Inc.
>>>>>> Menlo Park, California, USA
>>>>>> http://www.101tec.com
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>
>>


Mime
View raw message