Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of drdwitte@gmail.com designates
 209.85.217.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAGcQziE=XdT2yXkoJd0byyn5E+beOrbQF1sof+fNLHm5dmAO3g@mail.gmail.com>
References: 
 <CAGcQziEiLHQ4WiAQeiQByWCqqzxEx0M4G78e0tAw7AfgO+LR4g@mail.gmail.com>
	<CAGcQziE=XdT2yXkoJd0byyn5E+beOrbQF1sof+fNLHm5dmAO3g@mail.gmail.com>
Date: Tue, 25 Feb 2014 08:49:34 +0100
Message-ID: 
 <CALSJUsS4Q4B9xEYMr5Gkn-b=65AEH6QVMu8yqxL1AtRVi7YBgg@mail.gmail.com>
Subject: Re: Mappers vs. Map tasks
From: Dieter De Witte <drdwitte@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a11c37a1a4dbe5d04f3365448

--001a11c37a1a4dbe5d04f3365448
Content-Type: text/plain; charset=ISO-8859-1

Each node has a tasktracker with a number of map slots. A map slot hosts as
mapper. A mapper executes map tasks. If there are more map tasks than slots
obviously there will be multiple rounds of mapping.

The map function is called once for each input record. A block is typically
64MB and can contain a multitude of record, therefore a map task = run the
map() function on all records in the block.

Number of blocks = no. of map tasks (not mappers)

Furthermore you have to make a distinction between the two layers. You have
a layer for computations which consists of a jobtracker and a set of
tasktrackers. The other layer is responsible for storage. The HDFS has a
namenode and a set of datanodes.

In mapreduce the code is executed where the data is. So if a block is in
datanode 1, 2 and 3, then the map task associated with this block will
likely be executed on one of those physical nodes, by tasktracker 1, 2 or
3. But this is not necessary, thing can be rearranged.

Hopefully this gives you a little more insigth.

Regards, Dieter


2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com>:

> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
> no. of times the map() function will be called right?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
> sugandha.n87@gmail.com> wrote:
>
>> Hello,
>>
>> As per the various articles I went through till date, the File(s) are
>> split in chunks/blocks. On the same note, would like to ask few things:
>>
>>
>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>    Thus, if the file is smaller than the block size, only one mapper will be
>>    invoked. Right?
>>    2. If yes, it means, the map() will be called only once. Right? In
>>    this case, if there are two datanodes with a replication factor as 1: only
>>    one datanode(mapper machine) will perform the task. Right?
>>    3. The map() function is called by all the datanodes/slaves right? If
>>    the no. of mappers are more than the no. of slaves, what happens?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>

--001a11c37a1a4dbe5d04f3365448
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div><div>Each node has a tasktracker with =
a number of map slots. A map slot hosts as mapper. A mapper executes map ta=
sks. If there are more map tasks than slots obviously there will be multipl=
e rounds of mapping.<br>
<br></div>The map function is called once for each input record. A block is=
 typically 64MB and can contain a multitude of record, therefore a map task=
 =3D run the map() function on all records in the block.<br><br></div>Numbe=
r of blocks =3D no. of map tasks (not mappers)<br>
<br></div>Furthermore you have to make a distinction between the two layers=
. You have a layer for computations which consists of a jobtracker and a se=
t of tasktrackers. The other layer is responsible for storage. The HDFS has=
 a namenode and a set of datanodes.<br>
<br></div>In mapreduce the code is executed where the data is. So if a bloc=
k is in datanode 1, 2 and 3, then the map task associated with this block w=
ill likely be executed on one of those physical nodes, by tasktracker 1, 2 =
or 3. But this is not necessary, thing can be rearranged.<br>
<br></div>Hopefully this gives you a little more insigth.<br><br>Regards, D=
ieter<br></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote=
">2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <span dir=3D"ltr">&lt;<a href=
=3D"mailto:sugandha.n87@gmail.com" target=3D"_blank">sugandha.n87@gmail.com=
</a>&gt;</span>:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_default=
" style=3D"font-family:georgia,serif;color:rgb(255,0,0)">One more thing to =
ask: No. of blocks =3D no. of mappers. Thus, those many no. of times the ma=
p() function will be called right?<br>


</div></div><div class=3D"gmail_extra"><div class=3D""><br clear=3D"all"><d=
iv><div dir=3D"ltr"><div><font color=3D"#674ea7" face=3D"georgia, serif">--=
</font></div><div><font color=3D"#674ea7" face=3D"georgia, serif"><span sty=
le=3D"border-collapse:collapse;font-size:13px">Thanks &amp;=A0Regards,<br>


Sugandha Naolekar<br></span><br></font><div><font color=3D"#674ea7" face=3D=
"georgia, serif"><br></font><div><div style=3D"font-size:10px"><br></div></=
div></div></div></div></div>
<br><br></div><div><div class=3D"h5"><div class=3D"gmail_quote">On Tue, Feb=
 25, 2014 at 11:27 AM, Sugandha Naolekar <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:sugandha.n87@gmail.com" target=3D"_blank">sugandha.n87@gmail.com</a>&=
gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:georgia,=
serif;color:rgb(255,0,0)">Hello,<br><br></div><div class=3D"gmail_default" =
style=3D"font-family:georgia,serif;color:rgb(255,0,0)">As per the various a=
rticles I went through till date, the File(s) are split in chunks/blocks. O=
n the same note, would like to ask few things:<br>


<br><ol><li>No. of mappers are decided as: Total_File_Size/Max. Block Size.=
 Thus, if the file is smaller than the block size, only one mapper will be =
invoked. Right?</li><li>If yes, it means, the map() will be called only onc=
e. Right? In this case, if there are two datanodes with a replication facto=
r as 1: only one datanode(mapper machine) will perform the task. Right?</li=
>


<li>The map() function is called by all the datanodes/slaves right? If the =
no. of mappers are more than the no. of slaves, what happens?<br></li></ol>=
</div><div><div dir=3D"ltr"><div><font color=3D"#674ea7" face=3D"georgia, s=
erif">--</font></div>


<div><font color=3D"#674ea7" face=3D"georgia, serif"><span style=3D"border-=
collapse:collapse;font-size:13px">Thanks &amp;=A0Regards,<br>Sugandha Naole=
kar<br></span><br></font><div><font color=3D"#674ea7" face=3D"georgia, seri=
f"><br>


</font><div>
<div style=3D"font-size:10px"><br></div></div></div></div></div></div>
</div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>

--001a11c37a1a4dbe5d04f3365448--