Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of rahul.rec.dgp@gmail.com
 designates 209.85.220.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAMVC6RPKyNOqXMtdJ4jvB+Bwp5SQVjwB=kQ-umucOUJ+wixoHg@mail.gmail.com>
References: 
 <CAFY8jifis1Cn55V1f-4trVUbs=zbiM8X4mrvYcxqtyCRJK7b=Q@mail.gmail.com>
 <CAO7hTbNF2ZsM28g88fkL0U+XNcK3vpPUkK5pEw4hvjXAA5NaPw@mail.gmail.com>
 <CAMVC6RPmQWyhk4h7vcMPoYgZbtfMoUhV=Vmwp1OR5N+3aWTtOg@mail.gmail.com>
 <CAMVC6RPKyNOqXMtdJ4jvB+Bwp5SQVjwB=kQ-umucOUJ+wixoHg@mail.gmail.com>
From: Rahul Bhattacharjee <rahul.rec.dgp@gmail.com>
Date: Sun, 12 May 2013 18:03:09 +0530
Message-ID: 
 <CAO7hTbN0upB9fPTy5kNxqJ1gEGU99vHirgpLi4c8J3coHF1zaw@mail.gmail.com>
Subject: Re: Need help about task slots
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=bcaec517234d7a754704dc849b68

--bcaec517234d7a754704dc849b68
Content-Type: text/plain; charset=UTF-8

Oh! I though distcp works on complete files rather then mappers per
datablock.
So I guess parallelism would still be there if there are multipel files..
please correct if ther is anything wrong.

Thank,
Rahul


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <dontariq@gmail.com> wrote:

> @Rahul : I'm sorry as I am not aware of any such document. But you could
> use distcp for local to HDFS copy :
> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
> *
> *
> And yes. When you use distcp from local to HDFS, you can't take the
> pleasure of parallelism as the data is stored in a non distributed fashion.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>
>> Hello guys,
>>
>>              My 2 cents :
>>
>> Actually no. of mappers is primarily governed by the no. of InputSplits
>> created by the InputFormat you are using and the no. of reducers by the no.
>> of partitions you get after the map phase. Having said that, you should
>> also keep the no of slots, available per slave, in mind, along with the
>> available memory. But as a general rule you could use this approach :
>>
>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>> configure. For example, if you have 12 physical cores (or 24 virtual
>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>> you could choose how many mappers and reducers you want to use. With 18 MR
>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>> or whatever you think is OK with you.
>>
>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am also new to Hadoop world , here is my take on your question , if
>>> there is something missing then others would surely correct that.
>>>
>>> For per-YARN , the slots are fixed and computed based on the crunching
>>> capacity of the datanode hardware , once the slots per data node is
>>> ascertained , they are divided into Map and reducer slots and that goes
>>> into the config files and remain fixed , until changed.In YARN , its
>>> decided at runtime based on the kind of requirement of particular task.Its
>>> very much possible that a datanode at certain point of time running  10
>>> tasks and another similar datanode is only running 4 tasks.
>>>
>>> Coming to your question. Based of the data set size , block size of dfs
>>> and input formater , the number of map tasks are decided , generally for
>>> file based inputformats its one mapper per data block , however there are
>>> way to change this using configuration settings.Reduce tasks are set using
>>> job configuration.
>>>
>>> General rule as I have read from various documents is that Mappers
>>> should run atleast a minute , so you can run a sample to find out a good
>>> size of data block which would make you mapper run more than a minute. Now
>>> it again depends on your SLA , in case you are not looking for a very small
>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>
>>> But again its all theory , not sure how these things are handled in
>>> actual prod clusters.
>>>
>>> HTH,
>>>
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Users,
>>>>
>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>> would I know how many task slots would be required for a job. Is there any
>>>> empirical formula or on what basis should I set the number of task slots.
>>>>
>>>> Advanced Thanks
>>>>
>>>
>>>
>>
>

--bcaec517234d7a754704dc849b68
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:courier =
new,monospace;color:rgb(0,0,0)">Oh! I though distcp works on complete files=
 rather then mappers per datablock.<br></div><div class=3D"gmail_default" s=
tyle=3D"font-family:courier new,monospace;color:rgb(0,0,0)">

So I guess parallelism would still be there if there are multipel files.. p=
lease correct if ther is anything wrong.<br><br></div><div class=3D"gmail_d=
efault" style=3D"font-family:courier new,monospace;color:rgb(0,0,0)">Thank,=
<br>

</div><div class=3D"gmail_default" style=3D"font-family:courier new,monospa=
ce;color:rgb(0,0,0)">Rahul<br></div></div><div class=3D"gmail_extra"><br><b=
r><div class=3D"gmail_quote">On Sun, May 12, 2013 at 5:39 PM, Mohammad Tari=
q <span dir=3D"ltr">&lt;<a href=3D"mailto:dontariq@gmail.com" target=3D"_bl=
ank">dontariq@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">@Rahul : I&#39;m sorry as I=
 am not aware of any such document. But you could use distcp for local to H=
DFS copy :<div>

<b>bin/hadoop =C2=A0distcp =C2=A0file:///home/tariq/in.txt =C2=A0hdfs://loc=
alhost:9000/</b><br></div><div>

<b><br></b></div><div>And yes. When you use distcp from local to HDFS, you =
can&#39;t take the pleasure of parallelism as the data is stored in a non d=
istributed fashion.</div></div><div class=3D"gmail_extra"><br clear=3D"all"=
>


<div><div dir=3D"ltr">Warm Regards,<div>Tariq</div><div><a href=3D"http://c=
loudfront.blogspot.com" target=3D"_blank">cloudfront.blogspot.com</a><br></=
div></div></div><div><div class=3D"h5">
<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 11:07 PM, Mohamm=
ad Tariq <span dir=3D"ltr">&lt;<a href=3D"mailto:dontariq@gmail.com" target=
=3D"_blank">dontariq@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div dir=3D"ltr"><div><span style=3D"line-height:18px;font-size:14px;font-f=
amily:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif">Hel=
lo guys,=C2=A0</span><br></div><div><span style=3D"line-height:18px;font-si=
ze:14px;font-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,s=
ans-serif"><br>


</span></div><div><p style=3D"clear:both;vertical-align:baseline;line-heigh=
t:18px;font-size:14px;font-family:Arial,&#39;Liberation Sans&#39;,&#39;Deja=
Vu Sans&#39;,sans-serif;margin:0px 0px 1em;word-wrap:break-word;border:0px;=
padding:0px">


=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 My 2 cents :=C2=A0</p><p style=3D=
"clear:both;vertical-align:baseline;line-height:18px;font-size:14px;font-fa=
mily:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;margi=
n:0px 0px 1em;word-wrap:break-word;border:0px;padding:0px">


Actually no. of mappers is primarily governed by the no. of InputSplits cre=
ated by the InputFormat you are using and the no. of reducers by the no. of=
 partitions you get after the map phase. Having said that, you should also =
keep the no of slots, available per slave, in mind, along with the availabl=
e memory. But as a general rule you could use this approach :<br>


</p><p style=3D"clear:both;vertical-align:baseline;line-height:18px;font-si=
ze:14px;font-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,s=
ans-serif;margin:0px 0px 1em;word-wrap:break-word;border:0px;padding:0px">


Take the no. of virtual CPUs*.75 and that&#39;s the no. of slots you can co=
nfigure. For example, if you have 12 physical cores (or 24 virtual cores), =
you would have (24*.75)=3D18 slots. Now, based on your requirement you coul=
d choose how many mappers and reducers you want to use. With 18 MR slots, y=
ou could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or what=
ever you think is OK with you.=C2=A0</p>


<p style=3D"clear:both;vertical-align:baseline;line-height:18px;font-size:1=
4px;font-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-=
serif;margin:0px 0px 1em;word-wrap:break-word;border:0px;padding:0px">
I don&#39;t know if it ,makes much sense, but it helps me pretty decently.<=
br></p></div></div><div class=3D"gmail_extra"><br clear=3D"all"><div><div d=
ir=3D"ltr">Warm Regards,<div>Tariq</div><div><a href=3D"http://cloudfront.b=
logspot.com" target=3D"_blank">cloudfront.blogspot.com</a><br>


</div></div></div><div><div>
<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 8:57 PM, Rahul B=
hattacharjee <span dir=3D"ltr">&lt;<a href=3D"mailto:rahul.rec.dgp@gmail.co=
m" target=3D"_blank">rahul.rec.dgp@gmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex">


<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:courier =
new,monospace">Hi,<br><br></div><div class=3D"gmail_default" style=3D"font-=
family:courier new,monospace">I am also new to Hadoop world , here is my ta=
ke on your question , if there is something missing then others would surel=
y correct that.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">For per-YARN , the slots are fixed and computed based on the crunch=
ing capacity of the datanode hardware , once the slots per data node is asc=
ertained , they are divided into Map and reducer slots and that goes into t=
he config files and remain fixed , until changed.In YARN , its decided at r=
untime based on the kind of requirement of particular task.Its very much po=
ssible that a datanode at certain point of time running=C2=A0 10 tasks and =
another similar datanode is only running 4 tasks.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">Coming to your question. Based of the data set size , block size of=
 dfs and input formater , the number of map tasks are decided , generally f=
or file based inputformats its one mapper per data block , however there ar=
e way to change this using configuration settings.Reduce tasks are set usin=
g job configuration.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">General rule as I have read from various documents is that Mappers =
should run atleast a minute , so you can run a sample to find out a good si=
ze of data block which would make you mapper run more than a minute. Now it=
 again depends on your SLA , in case you are not looking for a very small S=
LA you can choose to run less mappers at the expense of higher runtime.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">But again its all theory , not sure how these things are handled in=
 actual prod clusters.<br><br></div><div class=3D"gmail_default" style=3D"f=
ont-family:courier new,monospace">


HTH,<br></div><div class=3D"gmail_default" style=3D"font-family:courier new=
,monospace"><br><br></div><div class=3D"gmail_default" style=3D"font-family=
:courier new,monospace"><br></div><div class=3D"gmail_default" style=3D"fon=
t-family:courier new,monospace">


Thanks,<br>Rahul<br></div></div><div><div><div class=3D"gmail_extra"><br><b=
r><div class=3D"gmail_quote">On Sat, May 11, 2013 at 8:02 PM, Shashidhar Ra=
o <span dir=3D"ltr">&lt;<a href=3D"mailto:raoshashidhar123@gmail.com" targe=
t=3D"_blank">raoshashidhar123@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>Hi Users,<br><br>=
</div>I am new to Hadoop and confused about task slots in a cluster. How wo=
uld I know how many task slots would be required for a job. Is there any em=
pirical formula or on what basis should I set the number of task slots.<br>


<br></div>Advanced Thanks<br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>

--bcaec517234d7a754704dc849b68--