hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: Mappers vs. Map tasks
Date Tue, 25 Feb 2014 08:37:55 GMT
The wiki (or Hadoop The Definitive Guide) are good ressources.
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats

Mapper is the name of the abstract class/interface. It does not really make
sense to talk about number of mappers.
A task is a jvm that can be launched only if there is a free slot ie for a
given slot, at a given time, there will be at maximum only a single task.
During the task, the configured Mapper will be instantiated.

Always :
Number of input splits = no. of map tasks

And generally :
number of hdfs blocks = number of input splits

Regards

Bertrand

PS : I don't know if it is only my client, but avoid red when writting a
mail.

On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com> wrote:

> Each node has a tasktracker with a number of map slots. A map slot hosts
> as mapper. A mapper executes map tasks. If there are more map tasks than
> slots obviously there will be multiple rounds of mapping.
>
> The map function is called once for each input record. A block is
> typically 64MB and can contain a multitude of record, therefore a map task
> = run the map() function on all records in the block.
>
> Number of blocks = no. of map tasks (not mappers)
>
> Furthermore you have to make a distinction between the two layers. You
> have a layer for computations which consists of a jobtracker and a set of
> tasktrackers. The other layer is responsible for storage. The HDFS has a
> namenode and a set of datanodes.
>
> In mapreduce the code is executed where the data is. So if a block is in
> datanode 1, 2 and 3, then the map task associated with this block will
> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
> 3. But this is not necessary, thing can be rearranged.
>
> Hopefully this gives you a little more insigth.
>
> Regards, Dieter
>
>
> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com>:
>
> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>> no. of times the map() function will be called right?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> As per the various articles I went through till date, the File(s) are
>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>
>>>
>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>    invoked. Right?
>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>    one datanode(mapper machine) will perform the task. Right?
>>>    3. The map() function is called by all the datanodes/slaves right?
>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message