Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of himanshuvj@gmail.com
 designates 74.125.82.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACBYxK+tD2chEH83pwzUT07dt5DVKXxN8dD-2N14cwwsQiMMAg@mail.gmail.com>
References: 
 <CAKG3H40M0v3WnPVz7yHCYHzytoyVC_bSqUzCBjcpGT36GX40PA@mail.gmail.com>
 <CACBYxK+tD2chEH83pwzUT07dt5DVKXxN8dD-2N14cwwsQiMMAg@mail.gmail.com>
From: Himanshu Vijay <himanshuvj@gmail.com>
Date: Tue, 1 Oct 2013 00:06:44 -0700
Message-ID: 
 <CAKG3H40+ApGewOfEB_e7zpfhBsQ6y5SiNJzO9jiU8FtV7o-YKg@mail.gmail.com>
Subject: Re: Cluster config: Mapper:Reducer Task Capapcity
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=e89a8f643342a44c5604e7a89949

--e89a8f643342a44c5604e7a89949
Content-Type: text/plain; charset=ISO-8859-1

What is the down side of increasing both
mapred.tasktracker.map.tasks.maximum
and mapred.tasktracker.reduce.tasks.maximum to same value ?

I read on this link<http://developer.yahoo.com/hadoop/tutorial/module7.html>that:
 mapred.tasktracker.map.tasks.maximum 1/2 * (cores/node) to 2 *
(cores/node)Number
of map tasks to deploy on each machine.
mapred.tasktracker.reduce.tasks.maximum1/2 * (cores/node) to 2 *
(cores/node) Number of reduce tasks to deploy on each machine.
Each node has 8 cores. So according to above guidance I should both the
configs from 4 to 16. The ratio of mapper to reducer doesn't really matter
as far as these two properties are concerned.


On Mon, Sep 30, 2013 at 12:52 PM, Sandy Ryza <sandy.ryza@cloudera.com>wrote:

> Hi Himanshu,
>
> Changing the ratio is definitely a reasonable thing to do.  The capacities
> come from the mapred.tasktracker.map.tasks.maximum
> and mapred.tasktracker.reduce.tasks.maximum tasktracker configurations.
>  You can tweak these on your nodes to get your desired ratio.
>
> -Sandy
>
>
> On Mon, Sep 30, 2013 at 12:39 PM, Himanshu Vijay <himanshuvj@gmail.com>wrote:
>
>> Hi,
>>
>> Our Hadoop cluster is running 0.20.203. The cluster currently has 'Map
>> Task Capacity' of 8900+ 'Reduce Task Capacity' of 3300+ resulting in a
>> ratio of 2.7. We have a lot of variety of jobs running and we want to
>> increase the throughput.
>>
>> My manual observation was that we hit the Mapper capacity and hence many
>> jobs have to wait even though lot of room left in Reduce capacity. I mined
>> the jobtracker logs for the jobs that completed and saw that on a hourly
>> basis as well as daily basis the mapper:reducer ratio was 4-5.
>>
>> To increase the throughput I was thinking that I experiment changing the
>> Map and Reducer Task Capacity such that the ratio is increased from 2.7 to
>> ~4.
>>
>> Does this sound like a correct approach ? Is this something that I can
>> control or it's determined automatically by Hadoop ?
>>
>> Have any of you done this kind of exercise ? If yes can you please direct
>> how to go about changing this ratio. I am not finding much literature on
>> it.
>>
>> Note: Mapper and ReducerTask Capacity is the max total no. of
>> mappers/reducers you can run on the cluster at any point.
>>
>> Regards,
>> -Himanshu Vijay
>>
>
>


-- 
-Himanshu Vijay

--e89a8f643342a44c5604e7a89949
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">What is the down side of increasing both mapred.tasktracke=
r.map.tasks.maximum and=A0mapred.tasktracker.reduce.tasks.maximum to same v=
alue ?=A0<div><br></div><div>I read <a href=3D"http://developer.yahoo.com/h=
adoop/tutorial/module7.html">on this link</a> that:</div>

<div><table style=3D"border-collapse:collapse;border-spacing:0px;font-size:=
13px;border:1px solid rgb(217,217,217);color:rgb(51,51,51);font-family:Open=
Sans,&#39;Helvetica Neue&#39;,&#39;Helvetica Neue&#39;,helvetica,arial,clea=
n,sans-serif;line-height:16px">

<tbody><tr style=3D"border:1px solid rgb(217,217,217)"><td style=3D"padding=
:0.2em 0.4em;border:1px solid rgb(234,234,234);vertical-align:top">mapred.t=
asktracker.map.tasks.maximum</td><td style=3D"padding:0.2em 0.4em;border:1p=
x solid rgb(234,234,234);vertical-align:top">

1/2 * (cores/node) to 2 * (cores/node)</td><td style=3D"padding:0.2em 0.4em=
;border:1px solid rgb(234,234,234);vertical-align:top">Number of map tasks =
to deploy on each machine.</td></tr><tr style=3D"border:1px solid rgb(217,2=
17,217)">

<td style=3D"padding:0.2em 0.4em;border:1px solid rgb(234,234,234);vertical=
-align:top">mapred.tasktracker.reduce.tasks.maximum</td><td style=3D"paddin=
g:0.2em 0.4em;border:1px solid rgb(234,234,234);vertical-align:top">1/2 * (=
cores/node) to 2 * (cores/node)</td>

<td style=3D"padding:0.2em 0.4em;border:1px solid rgb(234,234,234);vertical=
-align:top">Number of reduce tasks to deploy on each machine.</td></tr></tb=
ody></table><div class=3D"gmail_extra">Each node has 8 cores. So according =
to above guidance I should both the configs from 4 to 16. The ratio of mapp=
er to reducer doesn&#39;t really matter as far as these two properties are =
concerned.</div>

<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Mon, Sep 3=
0, 2013 at 12:52 PM, Sandy Ryza <span dir=3D"ltr">&lt;<a href=3D"mailto:san=
dy.ryza@cloudera.com" target=3D"_blank">sandy.ryza@cloudera.com</a>&gt;</sp=
an> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr">Hi Himanshu,<div><br></div><div>Changing =
the ratio is definitely a reasonable thing to do. =A0The capacities come fr=
om the=A0mapred.tasktracker.map.tasks.maximum and=A0mapred.tasktracker.redu=
ce.tasks.maximum tasktracker configurations. =A0You can tweak these on your=
 nodes to get your desired ratio. =A0=A0</div>

<span class=3D""><font color=3D"#888888">
<div><br></div><div>-Sandy</div></font></span></div><div class=3D""><div cl=
ass=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On=
 Mon, Sep 30, 2013 at 12:39 PM, Himanshu Vijay <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:himanshuvj@gmail.com" target=3D"_blank">himanshuvj@gmail.com</a=
>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr">Hi,<div><br></div><div>Our Hadoop cluster=
 is running 0.20.203. The cluster currently has &#39;Map Task Capacity&#39;=
 of 8900+ &#39;Reduce Task Capacity&#39; of 3300+ resulting in a ratio of 2=
.7. We have a lot of variety of jobs running and we want to increase the th=
roughput.=A0</div>


<div><br></div><div>My manual observation was that we hit the Mapper capaci=
ty and hence many jobs have to wait even though lot of room left in Reduce =
capacity. I mined the jobtracker logs for the jobs that completed and saw t=
hat on a hourly basis as well as daily basis the mapper:reducer ratio was 4=
-5.=A0</div>


<div><br></div><div>To increase the throughput I was thinking that I experi=
ment changing the Map and Reducer Task Capacity such that the ratio is incr=
eased from 2.7 to ~4.=A0</div><div><br></div><div>Does this sound like a co=
rrect approach ? Is this something that I can control or it&#39;s determine=
d automatically by Hadoop ?<br>


</div><div><br></div><div>Have any of you done this kind of exercise ? If y=
es can you please direct how to go about changing this ratio. I am not find=
ing much literature on it.=A0</div><div><br></div><div>Note: Mapper and Red=
ucerTask Capacity is the max total no. of mappers/reducers you can run on t=
he cluster at any point.</div>


<div><div><br></div>Regards,<br>-Himanshu Vijay
</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
-Himanshu Vijay
</div></div></div>

--e89a8f643342a44c5604e7a89949--