Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dontariq@gmail.com designates
 209.85.220.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAGcQziGKH2zFYU7pjVT8LynYLyTtEwBft3kJoSptwHqMBFihvw@mail.gmail.com>
References: 
 <CAGcQziEiLHQ4WiAQeiQByWCqqzxEx0M4G78e0tAw7AfgO+LR4g@mail.gmail.com>
 <CAGcQziE=XdT2yXkoJd0byyn5E+beOrbQF1sof+fNLHm5dmAO3g@mail.gmail.com>
 <CALSJUsS4Q4B9xEYMr5Gkn-b=65AEH6QVMu8yqxL1AtRVi7YBgg@mail.gmail.com>
 <CAO6W-2eKNRbTSN3cdHrmuxKfM1kUhc8+BnWzQihwcGK7t9iT5Q@mail.gmail.com>
 <CAGcQziGKH2zFYU7pjVT8LynYLyTtEwBft3kJoSptwHqMBFihvw@mail.gmail.com>
From: Mohammad Tariq <dontariq@gmail.com>
Date: Wed, 26 Feb 2014 02:09:10 +0530
Message-ID: 
 <CAMVC6RM+3n+bO0XCnHe+uKL6mtAVW9uf3An+eA2jp0dCuSDObg@mail.gmail.com>
Subject: Re: Mappers vs. Map tasks
To: "hdfs-user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=047d7b86e2d6f9c0ec04f341165e

--047d7b86e2d6f9c0ec04f341165e
Content-Type: text/plain; charset=ISO-8859-1

Hi Sugandha,

Please find my comments embedded below :

                  No. of mappers are decided as: Total_File_Size/Max. Block
Size. Thus, if the file is smaller than the block size, only one mapper
will be                               invoked. Right?
                  This is true(but not always). The basic criteria behind
map creation is the logic inside *getSplits* method of *InputFormat* being
used in your                     MR job. It is the behavior of *file based
InputFormats*, typically sub-classes of *FileInputFormat*, to split the
input data into splits based                     on the total size, in
bytes, of the input files. See
*this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for
more details. And yes, if the file is smaller than the block size then
only 1 mapper will                     be created.

                  If yes, it means, the map() will be called only once.
Right? In this case, if there are two datanodes with a replication factor
as 1: only one                               datanode(mapper machine) will
perform the task. Right?
                  A mapper is called for each split. Don't get confused
with the MR's split and HDFS's block. Both are different(They may overlap
though, as in                     case of FileInputFormat). HDFS blocks are
physical partitioning of your data, while an InputSplit is just a logical
partitioning. If you have a                       file which is smaller
than the HDFS blocksize then only one split will be created, hence only 1
mapper will be called. And this will happen on                     the node
where this file resides.

                  The map() function is called by all the datanodes/slaves
right? If the no. of mappers are more than the no. of slaves, what happens?
                  map() doesn't get called by anybody. It rather gets
created on the node where the chunk of data to be processed resides. A
slave node can run                       multiple mappers based on the
availability of CPU slots.

                 One more thing to ask: No. of blocks = no. of mappers.
Thus, those many no. of times the map() function will be called right?
                 No. of blocks = no. of splits = no. of mappers. A map is
called only once per split per node where that split is present.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar
<sugandha.n87@gmail.com>wrote:

> Hi Bertrand,
>
> As you said, no. of HDFS blocks =  no. of input splits. But this is only
> true when you set isSplittable() as false or when your input file size is
> less than the block size. Also, when it comes to text files, the default
> textinputformat considers each line as one input split which can be then
> read by RecordReader in K,V format.
>
> Please correct me if I don't make sense.
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <dechouxb@gmail.com>wrote:
>
>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>
>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>
>> Mapper is the name of the abstract class/interface. It does not really
>> make sense to talk about number of mappers.
>> A task is a jvm that can be launched only if there is a free slot ie for
>> a given slot, at a given time, there will be at maximum only a single task.
>> During the task, the configured Mapper will be instantiated.
>>
>> Always :
>> Number of input splits = no. of map tasks
>>
>> And generally :
>> number of hdfs blocks = number of input splits
>>
>> Regards
>>
>> Bertrand
>>
>> PS : I don't know if it is only my client, but avoid red when writting a
>> mail.
>>
>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com>wrote:
>>
>>> Each node has a tasktracker with a number of map slots. A map slot hosts
>>> as mapper. A mapper executes map tasks. If there are more map tasks than
>>> slots obviously there will be multiple rounds of mapping.
>>>
>>> The map function is called once for each input record. A block is
>>> typically 64MB and can contain a multitude of record, therefore a map task
>>> = run the map() function on all records in the block.
>>>
>>> Number of blocks = no. of map tasks (not mappers)
>>>
>>> Furthermore you have to make a distinction between the two layers. You
>>> have a layer for computations which consists of a jobtracker and a set of
>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>> namenode and a set of datanodes.
>>>
>>> In mapreduce the code is executed where the data is. So if a block is in
>>> datanode 1, 2 and 3, then the map task associated with this block will
>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>> 3. But this is not necessary, thing can be rearranged.
>>>
>>> Hopefully this gives you a little more insigth.
>>>
>>> Regards, Dieter
>>>
>>>
>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com>:
>>>
>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>>>> no. of times the map() function will be called right?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> As per the various articles I went through till date, the File(s) are
>>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>>
>>>>>
>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>>>    invoked. Right?
>>>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>>>    one datanode(mapper machine) will perform the task. Right?
>>>>>    3. The map() function is called by all the datanodes/slaves right?
>>>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

--047d7b86e2d6f9c0ec04f341165e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Sugandha,<div><br></div><div>Please find my comments em=
bedded below :</div><div><span style=3D"color:rgb(255,0,0);font-family:geor=
gia,serif;font-size:13px"><br></span></div><div><span style=3D"color:rgb(25=
5,0,0);font-family:georgia,serif;font-size:13px">=A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 No. of mappers are decided as: Total_File_Size/Max. Block Size.=
 Thus, if the file is smaller than the block size, only one mapper will be =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 invoked. Right?=
</span></div>


<div><div><font color=3D"#ff0000" face=3D"georgia, serif">=A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0=A0</font><font face=3D"georgia, serif" color=3D"#000000=
">This is true(but not always). The basic criteria behind map creation is t=
he logic inside <b>getSplits</b> method of <b>InputFormat</b> being used in=
 your =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 MR job. It is the behavior of=
 <b>file based InputFormats</b>, typically sub-classes of <b>FileInputForma=
t</b>, to split the input data into splits based =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 on the total size, in bytes, of the input files. See <a hre=
f=3D"http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce=
/InputFormat.html" target=3D"_blank"><b>this</b></a> for more details. And =
yes, if the file is smaller than the block size then only 1 mapper will =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 be created.</font></div>


</div><div><div><span style=3D"color:rgb(255,0,0);font-family:georgia,serif=
;font-size:13px"><br></span></div><div><span style=3D"color:rgb(255,0,0);fo=
nt-family:georgia,serif;font-size:13px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 If yes, it means, the map() will be called only once. Right? In this case,=
 if there are two datanodes with a replication factor as 1: only one =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 datanode(mapper mac=
hine) will perform the task. Right?</span></div>


<div><font color=3D"#ff0000" face=3D"georgia, serif">=A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0=A0</font><font face=3D"georgia, serif" color=3D"#000000">A =
mapper is called for each split. Don&#39;t get confused with the MR&#39;s s=
plit and HDFS&#39;s block. Both are different(They may overlap though, as i=
n =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 case of FileInputFormat). HDFS bl=
ocks are physical partitioning of your data, while an InputSplit is just a =
logical partitioning. If you have a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 file which is smaller than the HDFS blocksize then only one split will=
 be created, hence only 1 mapper will be called. And this will happen on =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 the node where this file resides.</=
font></div>


<div><font face=3D"georgia, serif" color=3D"#000000"><br></font></div><div>=
<span style=3D"color:rgb(255,0,0);font-family:georgia,serif;font-size:13px"=
>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 The map() function is called by all th=
e datanodes/slaves right? If the no. of mappers are more than the no. of sl=
aves, what happens?</span></div>


<div><font face=3D"georgia, serif"><font color=3D"#000000">=A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 map() doesn&#39;t get called by anybody. It rather gets=
 created on the node where the chunk of data to be processed resides. A sla=
ve node can run =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 multiple mapper=
s based on the availability of CPU slots.</font></font></div>


<div><span style=3D"color:rgb(255,0,0);font-family:georgia,serif;font-size:=
13px"><br></span></div><div><span style=3D"color:rgb(255,0,0);font-family:g=
eorgia,serif;font-size:13px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0One more th=
ing to ask: No. of blocks =3D no. of mappers. Thus, those many no. of times=
 the map() function will be called right?</span></div>


</div><div><span style=3D"font-family:georgia,serif;font-size:13px"><font c=
olor=3D"#000000">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0No. of blocks =3D no. o=
f splits =3D no. of mappers. A map is called only once per split per node w=
here that split is present.</font></span></div>


<div><span style=3D"font-family:georgia,serif;font-size:13px"><font color=
=3D"#000000"><br></font></span></div><div><span style=3D"font-family:georgi=
a,serif;font-size:13px"><font color=3D"#000000">HTH</font></span></div><div=
 class=3D"gmail_extra">


<br clear=3D"all"><div><div dir=3D"ltr">Warm Regards,<div>Tariq</div><div><=
a href=3D"http://cloudfront.blogspot.com" target=3D"_blank">cloudfront.blog=
spot.com</a><br></div></div></div>
<br><br><div class=3D"gmail_quote">On Tue, Feb 25, 2014 at 3:54 PM, Sugandh=
a Naolekar <span dir=3D"ltr">&lt;<a href=3D"mailto:sugandha.n87@gmail.com" =
target=3D"_blank">sugandha.n87@gmail.com</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-widt=
h:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-le=
ft:1ex">


<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:georgia,=
serif">Hi Bertrand, <br><br></div><div class=3D"gmail_default" style=3D"fon=
t-family:georgia,serif;color:rgb(255,0,0)"><span style=3D"color:rgb(0,0,0)"=
>As you said, no. of HDFS blocks =3D=A0 no. of input splits. But this is on=
ly true when you set isSplittable() as false or when your input file size i=
s less than the block size. Also, when it comes to text files, the default =
textinputformat considers each line as one input split which can be then re=
ad by RecordReader in K,V format. <br>


<br>Please correct me if I don&#39;t make sense.</span><br></div></div><div=
 class=3D"gmail_extra"><div><br clear=3D"all"><div><div dir=3D"ltr"><div><f=
ont face=3D"georgia, serif" color=3D"#674ea7">--</font></div><div><font fac=
e=3D"georgia, serif" color=3D"#674ea7"><span style=3D"border-collapse:colla=
pse;font-size:13px">Thanks &amp;=A0Regards,<br>


Sugandha Naolekar<br></span><br></font><div><font face=3D"georgia, serif" c=
olor=3D"#674ea7"><br></font><div><div style=3D"font-size:10px"><br></div></=
div></div></div></div></div>
<br><br></div><div><div><div class=3D"gmail_quote">On Tue, Feb 25, 2014 at =
2:07 PM, Bertrand Dechoux <span dir=3D"ltr">&lt;<a href=3D"mailto:dechouxb@=
gmail.com" target=3D"_blank">dechouxb@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">

<div dir=3D"ltr">The wiki (or Hadoop The Definitive Guide) are good ressour=
ces.<div><a href=3D"https://www.inkling.com/read/hadoop-definitive-guide-to=
m-white-3rd/chapter-7/input-formats" target=3D"_blank">https://www.inkling.=
com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats</a><=
br>


<div><br></div><div>Mapper is the name of the abstract class/interface. It =
does not really make sense to talk about number of mappers.</div><div>A tas=
k is a jvm that can be launched only if there is a free slot ie for a given=
 slot, at a given time, there will be at maximum only a single task. During=
 the task, the configured Mapper will be instantiated.</div>


<div><br></div><div>Always :</div><div><span style=3D"font-size:13px;font-f=
amily:arial,sans-serif">Number of input splits =3D no. of map tasks</span><=
br></div><div><br></div><div>And generally :</div><div>number of hdfs block=
s =3D number of input splits</div>


<div><br></div><div>Regards</div><span><font color=3D"#888888"><div><br></d=
iv><div>Bertrand</div></font></span></div><div><br></div><div>PS : I don=
9;t know if it is only my client, but avoid red when writting a mail.</div>


<div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">
On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:drdwitte@gmail.com" target=3D"_blank">drdwitte@gmail.com</a>&=
gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);bord=
er-left-style:solid;padding-left:1ex">


<div dir=3D"ltr"><div><div><div><div><div>Each node has a tasktracker with =
a number of map slots. A map slot hosts as mapper. A mapper executes map ta=
sks. If there are more map tasks than slots obviously there will be multipl=
e rounds of mapping.<br>


<br></div>The map function is called once for each input record. A block is=
 typically 64MB and can contain a multitude of record, therefore a map task=
 =3D run the map() function on all records in the block.<br><br></div>Numbe=
r of blocks =3D no. of map tasks (not mappers)<br>


<br></div>Furthermore you have to make a distinction between the two layers=
. You have a layer for computations which consists of a jobtracker and a se=
t of tasktrackers. The other layer is responsible for storage. The HDFS has=
 a namenode and a set of datanodes.<br>


<br></div>In mapreduce the code is executed where the data is. So if a bloc=
k is in datanode 1, 2 and 3, then the map task associated with this block w=
ill likely be executed on one of those physical nodes, by tasktracker 1, 2 =
or 3. But this is not necessary, thing can be rearranged.<br>


<br></div>Hopefully this gives you a little more insigth.<br><br>Regards, D=
ieter<br></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote=
">2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <span dir=3D"ltr">&lt;<a href=
=3D"mailto:sugandha.n87@gmail.com" target=3D"_blank">sugandha.n87@gmail.com=
</a>&gt;</span>:<div>


<div><br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr"><div style=3D"font-family:georgia,serif;c=
olor:rgb(255,0,0)">


One more thing to ask: No. of blocks =3D no. of mappers. Thus, those many n=
o. of times the map() function will be called right?<br>


</div></div><div class=3D"gmail_extra"><div><br clear=3D"all"><div><div dir=
=3D"ltr"><div><font color=3D"#674ea7" face=3D"georgia, serif">--</font></di=
v><div><font color=3D"#674ea7" face=3D"georgia, serif"><span style=3D"borde=
r-collapse:collapse;font-size:13px">Thanks &amp;=A0Regards,<br>


Sugandha Naolekar<br></span><br></font><div><font color=3D"#674ea7" face=3D=
"georgia, serif"><br></font><div><div style=3D"font-size:10px"><br></div></=
div></div></div></div></div>
<br><br></div><div><div><div class=3D"gmail_quote">On Tue, Feb 25, 2014 at =
11:27 AM, Sugandha Naolekar <span dir=3D"ltr">&lt;<a href=3D"mailto:sugandh=
a.n87@gmail.com" target=3D"_blank">sugandha.n87@gmail.com</a>&gt;</span> wr=
ote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">

<div dir=3D"ltr"><div style=3D"font-family:georgia,serif;color:rgb(255,0,0)=
">Hello,<br><br></div><div style=3D"font-family:georgia,serif;color:rgb(255=
,0,0)">As per the various articles I went through till date, the File(s) ar=
e split in chunks/blocks. On the same note, would like to ask few things:<b=
r>


<br><ol><li>No. of mappers are decided as: Total_File_Size/Max. Block Size.=
 Thus, if the file is smaller than the block size, only one mapper will be =
invoked. Right?</li><li>If yes, it means, the map() will be called only onc=
e. Right? In this case, if there are two datanodes with a replication facto=
r as 1: only one datanode(mapper machine) will perform the task. Right?</li=
>


<li>The map() function is called by all the datanodes/slaves right? If the =
no. of mappers are more than the no. of slaves, what happens?<br></li></ol>=
</div><div><div dir=3D"ltr"><div><font color=3D"#674ea7" face=3D"georgia, s=
erif">--</font></div>


<div><font color=3D"#674ea7" face=3D"georgia, serif"><span style=3D"border-=
collapse:collapse;font-size:13px">Thanks &amp;=A0Regards,<br>Sugandha Naole=
kar<br></span><br></font><div><font color=3D"#674ea7" face=3D"georgia, seri=
f"><br>


</font><div>
<div style=3D"font-size:10px"><br></div></div></div></div></div></div>
</div>
</blockquote></div><br></div></div></div>
</blockquote></div></div></div><br></div>
</blockquote></div><br></div></div></div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div></div>

--047d7b86e2d6f9c0ec04f341165e--