hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sugandha Naolekar <sugandha....@gmail.com>
Subject Re: Mappers vs. Map tasks
Date Tue, 25 Feb 2014 10:24:23 GMT
Hi Bertrand,

As you said, no. of HDFS blocks =  no. of input splits. But this is only
true when you set isSplittable() as false or when your input file size is
less than the block size. Also, when it comes to text files, the default
textinputformat considers each line as one input split which can be then
read by RecordReader in K,V format.

Please correct me if I don't make sense.

--
Thanks & Regards,
Sugandha Naolekar





On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <dechouxb@gmail.com>wrote:

> The wiki (or Hadoop The Definitive Guide) are good ressources.
>
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>
> Mapper is the name of the abstract class/interface. It does not really
> make sense to talk about number of mappers.
> A task is a jvm that can be launched only if there is a free slot ie for a
> given slot, at a given time, there will be at maximum only a single task.
> During the task, the configured Mapper will be instantiated.
>
> Always :
> Number of input splits = no. of map tasks
>
> And generally :
> number of hdfs blocks = number of input splits
>
> Regards
>
> Bertrand
>
> PS : I don't know if it is only my client, but avoid red when writting a
> mail.
>
> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com>wrote:
>
>> Each node has a tasktracker with a number of map slots. A map slot hosts
>> as mapper. A mapper executes map tasks. If there are more map tasks than
>> slots obviously there will be multiple rounds of mapping.
>>
>> The map function is called once for each input record. A block is
>> typically 64MB and can contain a multitude of record, therefore a map task
>> = run the map() function on all records in the block.
>>
>> Number of blocks = no. of map tasks (not mappers)
>>
>> Furthermore you have to make a distinction between the two layers. You
>> have a layer for computations which consists of a jobtracker and a set of
>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>> namenode and a set of datanodes.
>>
>> In mapreduce the code is executed where the data is. So if a block is in
>> datanode 1, 2 and 3, then the map task associated with this block will
>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>> 3. But this is not necessary, thing can be rearranged.
>>
>> Hopefully this gives you a little more insigth.
>>
>> Regards, Dieter
>>
>>
>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com>:
>>
>> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>>> no. of times the map() function will be called right?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> As per the various articles I went through till date, the File(s) are
>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>
>>>>
>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>>    Thus, if the file is smaller than the block size, only one mapper will
be
>>>>    invoked. Right?
>>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>>    this case, if there are two datanodes with a replication factor as 1:
only
>>>>    one datanode(mapper machine) will perform the task. Right?
>>>>    3. The map() function is called by all the datanodes/slaves right?
>>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message