Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of stransky.ja@gmail.com
 designates 209.85.217.171 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAHodO=Jmf2uXSOUxLN7XW-W3dDRRWHreL4-q7Q8LjtRAsVGKEQ@mail.gmail.com>
References: 
 <CAJOOh6G9D-8YJPGsWtQhOSkf0tvkgB+V_0+NFXZmaeu0-49fcQ@mail.gmail.com>
	<CAHodO=+eGFALPVXW8sefapieAz52v-e6WeDtCEhyutsDj7g=cA@mail.gmail.com>
	<CAJOOh6GafRDQ-TkzD3Xa4cU1Ca-QS=m3DHLUhbYDgr2wNyksyg@mail.gmail.com>
	<CAHodO=Jmf2uXSOUxLN7XW-W3dDRRWHreL4-q7Q8LjtRAsVGKEQ@mail.gmail.com>
Date: Fri, 12 Sep 2014 21:31:08 +0200
Message-ID: 
 <CAJOOh6E5dRFfPg3Wq_Ju-v9dPdN=GLryKyLSAdrC4+0N9TDZxw@mail.gmail.com>
Subject: Re: CPU utilization
From: Jakub Stransky <stransky.ja@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=14dae93d94d8ad62580502e35378

--14dae93d94d8ad62580502e35378
Content-Type: text/plain; charset=UTF-8

Adam, how did you come to the conclusion that it is memory bounded? I
haven't found any such sign, even though the map phase were assigned 768MB,
job counters reported that just something around 600MB were use and no
significant GC time imposed.

To be more specific about the job what in essence do is loading data out of
kafka messaging in protocol buffers format deserialize those and remap to
avro data format. And that is performed on per record bases except the
kafka reader which performs bulk read via buffer. Increasing a buffer size
and fetch size didn't have any significant impact.

May be completely silly question: how do I recognize that I have a memory
bound job? As having a ~600MB heap and GC time somewhere around 30sec out
of 60 min long job doesn't seem to me as a sign of insufficient memory.
I don't see any apparent bound except that I mentioned on CPU per task
process via top command.


On 12 September 2014 20:57, Adam Kawa <kawa.adam@gmail.com> wrote:

> Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for
> allocating containers.
>
> If you run map task, you need 768 MB (mapreduce.map.memory.mb).
> If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb).
> If you run the MapReduce app master, you need 1024 MB (
> yarn.app.mapreduce.am.resource.mb).
>
> Therefore, you run MapReduce job, you can run only 2 containers per
> NodeManager (3 x 768 = 2304 < 2048) on your setup.
>
> 2014-09-12 20:37 GMT+02:00 Jakub Stransky <stransky.ja@gmail.com>:
>
>
>>  I thought that memory assigned has to be muliply of
>> yarn.scheduler.minimum-allocation-mb and is rounded according that.
>>
>
> That's right. It also specifies the minimum size of a container to prevent
> from requesting unreasonable small containers (that are likely to cause
> tasks failures).
>
>>
>> any other I am not aware of. Are there any additional parameters like
>> that you mentioned which should be set?
>>
>
> There are also settings related to vcores in mapred-site.xml and
> yarn-site.xml. But they don't change anything in your case (as you are
> limited by the memory, not vcores).
>
>
>> The job wasn't the smallest but wasn't PB of data. Was run on 1.5GB of
>> data and run for 60min. I wasn't able to make any significant improvment.
>> It is map only job. And wasn't able to achive more that 30% of total
>> machine cpu utilization. Howewer top command were displaying 100 %cpu for
>> process running on data node, that's why I was thinking that way about
>> limit on container process limit. I didn't find any other boundary like io
>> or network or memory.
>>
>
> CPU utilization depends on type of your jobs (e.g. doing complex math
> operations or just counting words) and the number of containers you run. If
> you want to play with this, you can run more CPU-bound jobs or increase the
> number of containers running on a node.
>


-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

--14dae93d94d8ad62580502e35378
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Adam, how did you come to the conclusion that it is memory=
 bounded? I haven&#39;t found any such sign, even though the map phase were=
 assigned 768MB, job counters reported that just something around 600MB wer=
e use and no significant GC time imposed.<div><br></div><div>To be more spe=
cific about the job what in essence do is loading data out of kafka messagi=
ng in protocol buffers format deserialize those and remap to avro data form=
at. And that is performed on per record bases except the kafka reader which=
 performs bulk read via buffer. Increasing a buffer size and fetch size did=
n&#39;t have any significant impact.</div><div><br></div><div>May be comple=
tely silly question: how do I recognize that I have a memory bound job? As =
having a ~600MB heap and GC time somewhere around 30sec out of 60 min long =
job doesn&#39;t seem to me as a sign of insufficient memory.</div><div>I do=
n&#39;t see any apparent bound except that I mentioned on CPU per task proc=
ess via top command.</div><div><br></div></div><div class=3D"gmail_extra"><=
br><div class=3D"gmail_quote">On 12 September 2014 20:57, Adam Kawa <span d=
ir=3D"ltr">&lt;<a href=3D"mailto:kawa.adam@gmail.com" target=3D"_blank">kaw=
a.adam@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><d=
iv dir=3D"ltr">Your NodeManager can use=C2=A0<span style=3D"color:rgb(80,0,=
80);font-family:arial,sans-serif;font-size:13px">2048 MB (</span><span styl=
e=3D"font-size:13px;color:rgb(80,0,80);font-family:arial,sans-serif">yarn.n=
odemanager.resource.</span><span style=3D"font-size:13px;color:rgb(80,0,80)=
;font-family:arial,sans-serif">memory-mb)</span>=C2=A0for allocating contai=
ners.=C2=A0<div><br></div><div>If you run map task, you need=C2=A0<span sty=
le=3D"color:rgb(80,0,80);font-family:arial,sans-serif;font-size:13px">768 M=
B (</span><span style=3D"color:rgb(80,0,80);font-family:arial,sans-serif;fo=
nt-size:13px">mapreduce.map.memory.mb)</span>.</div><div>If you run reduce =
task, you need=C2=A0<span style=3D"color:rgb(80,0,80);font-family:arial,san=
s-serif;font-size:13px">1024 MB (</span><span style=3D"color:rgb(80,0,80);f=
ont-family:arial,sans-serif;font-size:13px">mapreduce.reduce.memory.mb)</sp=
an>.</div><div>If you run the MapReduce app master, you need=C2=A0<span sty=
le=3D"font-size:13px;color:rgb(80,0,80);font-family:arial,sans-serif">1024 =
MB (</span><span style=3D"font-size:13px;color:rgb(80,0,80);font-family:ari=
al,sans-serif"><a href=3D"http://yarn.app.mapreduce.am" target=3D"_blank">y=
arn.app.mapreduce.am</a>.</span><span style=3D"font-size:13px;color:rgb(80,=
0,80);font-family:arial,sans-serif">resource.mb)</span>.</div><div><br></di=
v><div>Therefore, you run MapReduce job, you can run only 2 containers per =
NodeManager (3 x=C2=A0<span style=3D"color:rgb(80,0,80);font-family:arial,s=
ans-serif;font-size:13px">768 =3D=C2=A0</span><font color=3D"#500050" face=
=3D"arial, sans-serif">2304 &lt; 2048) on your setup.</font></div><div><br>=
</div><div class=3D"gmail_extra"><div class=3D"gmail_quote"><span class=3D"=
">2014-09-12 20:37 GMT+02:00 Jakub Stransky <span dir=3D"ltr">&lt;<a href=
=3D"mailto:stransky.ja@gmail.com" target=3D"_blank">stransky.ja@gmail.com</=
a>&gt;</span>:<br><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(20=
4,204,204);border-left-style:solid;padding-left:1ex"><div dir=3D"ltr"><div>=
=C2=A0I thought that memory assigned has to be muliply of=C2=A0<span style=
=3D"font-size:13px;font-family:arial,sans-serif">yarn.scheduler.minimum-</s=
pan><span style=3D"font-size:13px;font-family:arial,sans-serif">allocation-=
mb and is rounded according that.</span></div></div></blockquote><div><br><=
/div></span><div>That&#39;s right. It also specifies the minimum size of a =
container to prevent from requesting unreasonable small containers (that ar=
e likely to cause tasks failures).</div><span class=3D""><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;bo=
rder-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">=
<div dir=3D"ltr"><span><div style=3D"font-family:arial,sans-serif;font-size=
:13px"><br></div></span><div style=3D"font-family:arial,sans-serif;font-siz=
e:13px">any other I am not aware of. Are there any additional parameters li=
ke that you mentioned which should be set?</div></div></blockquote><div><br=
></div></span><div>There are also settings related to vcores in mapred-site=
.xml and yarn-site.xml. But they don&#39;t change anything in your case (as=
 you are limited by the memory, not vcores).</div><span class=3D""><div>=C2=
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8e=
x;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-styl=
e:solid;padding-left:1ex"><div dir=3D"ltr"><div style=3D"font-family:arial,=
sans-serif;font-size:13px"> The job wasn&#39;t the smallest but wasn&#39;t =
PB of data. Was run on 1.5GB of data and run for 60min. I wasn&#39;t able t=
o make any significant improvment. It is map only job. And wasn&#39;t able =
to achive more that 30% of total machine cpu utilization. Howewer top comma=
nd were displaying 100 %cpu for process running on data node, that&#39;s wh=
y I was thinking that way about limit on container process limit. I didn=
9;t find any other boundary like io or network or memory.</div></div></bloc=
kquote><div><br></div></span><div>CPU utilization depends on type of your j=
obs (e.g. doing complex math operations or just counting words) and the num=
ber of containers you run. If you want to play with this, you can run more =
CPU-bound jobs or increase the number of containers running on a node.</div=
></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div dir=3D"=
ltr"><div style=3D"color:rgb(136,136,136);font-family:arial,sans-serif;font=
-size:13.333333969116211px;background-color:rgb(255,255,255)">Jakub Stransk=
y</div><div style=3D"color:rgb(136,136,136);font-family:arial,sans-serif;fo=
nt-size:13.333333969116211px;background-color:rgb(255,255,255)"><a href=3D"=
http://cz.linkedin.com/in/jakubstransky" title=3D"View public profile" name=
=3D"SafeHtmlFilter_13ad224971410aaf_webProfileURL" style=3D"color:rgb(102,1=
02,102);margin:0px 10px 0px 0px;padding:0px 0px 0px 19px;border:0px;outline=
:none;font-size:11px;font-family:Arial,Helvetica,&#39;Nimbus Sans L&#39;,sa=
ns-serif;vertical-align:middle;text-decoration:none;display:inline-block;ba=
ckground-image:url(http://static02.linkedin.com/scds/common/u/img/sprite/sp=
rite_profile_top_card_v4.png);background-color:rgb(246,246,246);line-height=
:13px;background-repeat:no-repeat no-repeat" target=3D"_blank">cz.linkedin.=
com/in/jakubstransky</a></div><div style=3D"color:rgb(136,136,136);font-fam=
ily:arial,sans-serif;font-size:13.333333969116211px;background-color:rgb(25=
5,255,255)"><img src=3D"https://www.evernote.com/shard/s52/res/5e2a6142-32b=
d-4fcc-a9c6-66ab14f45505/qrcode.20609635.png?resizeSmall&amp;width=3D1340" =
alt=3D"" name=3D"SafeHtmlFilter_5e2a6142-32bd-4fcc-a9c6-66ab14f45505" style=
=3D"margin:0.857412em 0px 0px;padding:0px;border:0px;color:rgb(0,0,0);font-=
family:Helvetica,Arial,&#39;Droid Sans&#39;,sans-serif;font-size:14px;line-=
height:19.9999942779541px" width=3D"96" height=3D"96"><br></div></div>
</div>

--14dae93d94d8ad62580502e35378--