Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dontariq@gmail.com designates
 209.85.212.50 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAO7hTbNF2ZsM28g88fkL0U+XNcK3vpPUkK5pEw4hvjXAA5NaPw@mail.gmail.com>
References: 
 <CAFY8jifis1Cn55V1f-4trVUbs=zbiM8X4mrvYcxqtyCRJK7b=Q@mail.gmail.com>
 <CAO7hTbNF2ZsM28g88fkL0U+XNcK3vpPUkK5pEw4hvjXAA5NaPw@mail.gmail.com>
From: Mohammad Tariq <dontariq@gmail.com>
Date: Sat, 11 May 2013 23:07:38 +0530
Message-ID: 
 <CAMVC6RPmQWyhk4h7vcMPoYgZbtfMoUhV=Vmwp1OR5N+3aWTtOg@mail.gmail.com>
Subject: Re: Need help about task slots
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a11c23e04cfdc3f04dc74bf66

--001a11c23e04cfdc3f04dc74bf66
Content-Type: text/plain; charset=ISO-8859-1

Hello guys,

             My 2 cents :

Actually no. of mappers is primarily governed by the no. of InputSplits
created by the InputFormat you are using and the no. of reducers by the no.
of partitions you get after the map phase. Having said that, you should
also keep the no of slots, available per slave, in mind, along with the
available memory. But as a general rule you could use this approach :

Take the no. of virtual CPUs*.75 and that's the no. of slots you can
configure. For example, if you have 12 physical cores (or 24 virtual
cores), you would have (24*.75)=18 slots. Now, based on your requirement
you could choose how many mappers and reducers you want to use. With 18 MR
slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
or whatever you think is OK with you.

I don't know if it ,makes much sense, but it helps me pretty decently.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I am also new to Hadoop world , here is my take on your question , if
> there is something missing then others would surely correct that.
>
> For per-YARN , the slots are fixed and computed based on the crunching
> capacity of the datanode hardware , once the slots per data node is
> ascertained , they are divided into Map and reducer slots and that goes
> into the config files and remain fixed , until changed.In YARN , its
> decided at runtime based on the kind of requirement of particular task.Its
> very much possible that a datanode at certain point of time running  10
> tasks and another similar datanode is only running 4 tasks.
>
> Coming to your question. Based of the data set size , block size of dfs
> and input formater , the number of map tasks are decided , generally for
> file based inputformats its one mapper per data block , however there are
> way to change this using configuration settings.Reduce tasks are set using
> job configuration.
>
> General rule as I have read from various documents is that Mappers should
> run atleast a minute , so you can run a sample to find out a good size of
> data block which would make you mapper run more than a minute. Now it again
> depends on your SLA , in case you are not looking for a very small SLA you
> can choose to run less mappers at the expense of higher runtime.
>
> But again its all theory , not sure how these things are handled in actual
> prod clusters.
>
> HTH,
>
>
>
> Thanks,
> Rahul
>
>
> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>> Hi Users,
>>
>> I am new to Hadoop and confused about task slots in a cluster. How would
>> I know how many task slots would be required for a job. Is there any
>> empirical formula or on what basis should I set the number of task slots.
>>
>> Advanced Thanks
>>
>
>

--001a11c23e04cfdc3f04dc74bf66
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div style><span style=3D"color:rgb(0,0,0);font-family:Ari=
al,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;font-size:14p=
x;line-height:18px">Hello guys,=A0</span><br></div><div style><span style=
=3D"color:rgb(0,0,0);font-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaV=
u Sans&#39;,sans-serif;font-size:14px;line-height:18px"><br>

</span></div><div style><p style=3D"margin:0px 0px 1em;padding:0px;border:0=
px;font-size:14px;vertical-align:baseline;clear:both;word-wrap:break-word;c=
olor:rgb(0,0,0);font-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu San=
s&#39;,sans-serif;line-height:18px">

=A0 =A0 =A0 =A0 =A0 =A0 My 2 cents :=A0</p><p style=3D"margin:0px 0px 1em;p=
adding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;wor=
d-wrap:break-word;color:rgb(0,0,0);font-family:Arial,&#39;Liberation Sans&#=
39;,&#39;DejaVu Sans&#39;,sans-serif;line-height:18px">

Actually no. of mappers is primarily governed by the no. of InputSplits cre=
ated by the InputFormat you are using and the no. of reducers by the no. of=
 partitions you get after the map phase. Having said that, you should also =
keep the no of slots, available per slave, in mind, along with the availabl=
e memory. But as a general rule you could use this approach :<br>

</p><p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;ve=
rtical-align:baseline;clear:both;word-wrap:break-word;color:rgb(0,0,0);font=
-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;li=
ne-height:18px">

Take the no. of virtual CPUs*.75 and that&#39;s the no. of slots you can co=
nfigure. For example, if you have 12 physical cores (or 24 virtual cores), =
you would have (24*.75)=3D18 slots. Now, based on your requirement you coul=
d choose how many mappers and reducers you want to use. With 18 MR slots, y=
ou could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or what=
ever you think is OK with you.=A0</p>

<p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertic=
al-align:baseline;clear:both;word-wrap:break-word;color:rgb(0,0,0);font-fam=
ily:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;line-h=
eight:18px">

I don&#39;t know if it ,makes much sense, but it helps me pretty decently.<=
br></p></div></div><div class=3D"gmail_extra"><br clear=3D"all"><div><div d=
ir=3D"ltr">Warm Regards,<div>Tariq</div><div><a href=3D"http://cloudfront.b=
logspot.com" target=3D"_blank">cloudfront.blogspot.com</a><br>

</div></div></div>
<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 8:57 PM, Rahul B=
hattacharjee <span dir=3D"ltr">&lt;<a href=3D"mailto:rahul.rec.dgp@gmail.co=
m" target=3D"_blank">rahul.rec.dgp@gmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex">

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:courier =
new,monospace">Hi,<br><br></div><div class=3D"gmail_default" style=3D"font-=
family:courier new,monospace">I am also new to Hadoop world , here is my ta=
ke on your question , if there is something missing then others would surel=
y correct that.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">For per-YARN , the slots are fixed and computed based on the crunch=
ing capacity of the datanode hardware , once the slots per data node is asc=
ertained , they are divided into Map and reducer slots and that goes into t=
he config files and remain fixed , until changed.In YARN , its decided at r=
untime based on the kind of requirement of particular task.Its very much po=
ssible that a datanode at certain point of time running=A0 10 tasks and ano=
ther similar datanode is only running 4 tasks.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">Coming to your question. Based of the data set size , block size of=
 dfs and input formater , the number of map tasks are decided , generally f=
or file based inputformats its one mapper per data block , however there ar=
e way to change this using configuration settings.Reduce tasks are set usin=
g job configuration.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">General rule as I have read from various documents is that Mappers =
should run atleast a minute , so you can run a sample to find out a good si=
ze of data block which would make you mapper run more than a minute. Now it=
 again depends on your SLA , in case you are not looking for a very small S=
LA you can choose to run less mappers at the expense of higher runtime.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">But again its all theory , not sure how these things are handled in=
 actual prod clusters.<br><br></div><div class=3D"gmail_default" style=3D"f=
ont-family:courier new,monospace">


HTH,<br></div><div class=3D"gmail_default" style=3D"font-family:courier new=
,monospace"><br><br></div><div class=3D"gmail_default" style=3D"font-family=
:courier new,monospace"><br></div><div class=3D"gmail_default" style=3D"fon=
t-family:courier new,monospace">


Thanks,<br>Rahul<br></div></div><div class=3D"HOEnZb"><div class=3D"h5"><di=
v class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Sat, May 11, =
2013 at 8:02 PM, Shashidhar Rao <span dir=3D"ltr">&lt;<a href=3D"mailto:rao=
shashidhar123@gmail.com" target=3D"_blank">raoshashidhar123@gmail.com</a>&g=
t;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>Hi Users,<br><br>=
</div>I am new to Hadoop and confused about task slots in a cluster. How wo=
uld I know how many task slots would be required for a job. Is there any em=
pirical formula or on what basis should I set the number of task slots.<br>


<br></div>Advanced Thanks<br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a11c23e04cfdc3f04dc74bf66--