Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
Sender: ewenstephan@gmail.com
In-Reply-To: 
 <CAELUF_ChUXVJiz=J8Yz7ei179rCdo5bdB4UDnM0mJ1s4=3V6mA@mail.gmail.com>
References: 
 <CAELUF_AirYc5x9+5NW_FwoH94s8_vMeNs0p8ep0Dc8O27NHE7w@mail.gmail.com>
	<CAELUF_ChUXVJiz=J8Yz7ei179rCdo5bdB4UDnM0mJ1s4=3V6mA@mail.gmail.com>
Date: Fri, 5 Feb 2016 16:09:06 +0100
Message-ID: 
 <CANC1h_vVZ3ZbY_gZwq3g2NgbLRuvsXzPh9sLNW+gR10ozJnwpg@mail.gmail.com>
Subject: Re: Performance insights
From: Stephan Ewen <sewen@apache.org>
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=089e0115fbcc86799c052b073b1f

--089e0115fbcc86799c052b073b1f
Content-Type: text/plain; charset=UTF-8

Yes, that is definitely one possible explanation.

Another one could be that there is data skew, that increased parallelism
does not take work of the most overloaded partition (but reduces available
memory from that partition).
The web dashboard should actually help you with checking that.


On Fri, Feb 5, 2016 at 3:34 PM, Flavio Pompermaier <pompermaier@okkam.it>
wrote:

> Sorry, I forgot to say that the numberOfTaskSlots is always 6..
>
> On Fri, Feb 5, 2016 at 3:32 PM, Flavio Pompermaier <pompermaier@okkam.it>
> wrote:
>
>> Hi to all,
>>
>> I'm testing how to speed up my Flink job and I faced the following
>> situations in my *6 nodes* cluster (where each node has 8 CPUs) and 1
>> node does also the job manager:
>>
>> Scenario 1:
>>
>>    - # of network buffers 4096
>>    - parallelism: 36
>>    - *The job fails because I have not enough network buffers*
>>
>> Scenario 2:
>>
>>    - # of network buffers *8192*
>>    - parallelism: 36
>>    - *The job ends successfully in about 20 minutes *
>>
>> Scenario 3:
>>
>>    - # of network buffers *4096*
>>    - 6 nodes
>>    - parallelism: *6*
>>    - *The job ends successfully in about 11 minutes*
>>
>> What can I infer from those results? That my job is I/O bounded thus
>> having more threads in the same machine accessing simultaneously to the
>> disk downgrade the performance of the pipeline?
>>
>> Best,
>> Flavio
>>
>
>

--089e0115fbcc86799c052b073b1f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Yes, that is definitely one possible explanation.<div><br>=
</div><div>Another one could be that there is data skew, that increased par=
allelism does not take work of the most overloaded partition (but reduces a=
vailable memory from that partition).</div><div>The web dashboard should ac=
tually help you with checking that.</div><div><br></div></div><div class=3D=
"gmail_extra"><br><div class=3D"gmail_quote">On Fri, Feb 5, 2016 at 3:34 PM=
, Flavio Pompermaier <span dir=3D"ltr">&lt;<a href=3D"mailto:pompermaier@ok=
kam.it" target=3D"_blank">pompermaier@okkam.it</a>&gt;</span> wrote:<br><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex"><div dir=3D"ltr">Sorry, I forgot to say that th=
e numberOfTaskSlots is always 6..<div><div class=3D"h5"><div class=3D"gmail=
_extra"><br><div class=3D"gmail_quote">On Fri, Feb 5, 2016 at 3:32 PM, Flav=
io Pompermaier <span dir=3D"ltr">&lt;<a href=3D"mailto:pompermaier@okkam.it=
" target=3D"_blank">pompermaier@okkam.it</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex"><div dir=3D"ltr">Hi to all,<div><div dir=3D"ltr"><p><=
/p><p></p><p></p><p></p></div></div>
<div>I&#39;m testing how to speed up my Flink job and I faced the following=
 situations in my <b>6 nodes</b> cluster (where each node has 8 CPUs) and 1=
 node does also the job manager:</div><div><br></div><div>Scenario 1:</div>=
<div><ul><li># of network buffers 4096</li><li>parallelism: 36</li><li><b>T=
he job <font color=3D"#ff0000">fails</font> because I have not enough netwo=
rk buffers</b></li></ul><div>Scenario 2:<b><br></b></div><ul><li># of netwo=
rk buffers=C2=A0<font color=3D"#ff0000"><b>8192</b></font></li><li>parallel=
ism: 36</li><li><b>The job ends successfully in about <font color=3D"#ff000=
0">20 minutes</font>=C2=A0</b></li></ul><div><div>Scenario 3:<b><br></b></d=
iv><ul><li># of network buffers=C2=A0<font color=3D"#ff0000"><b>4096</b></f=
ont><br></li><li>6 nodes</li><li>parallelism: <b><font color=3D"#ff0000">6<=
/font></b></li><li><b>The job ends successfully in about <font color=3D"#ff=
0000">11 minutes</font></b></li></ul></div><div></div></div><div>What can I=
 infer from those results? That my job is I/O bounded thus having more thre=
ads in the same machine accessing simultaneously to the disk downgrade the =
performance of the pipeline?</div><div><br></div><div>Best,</div><div>Flavi=
o</div></div>
</blockquote></div><div><div dir=3D"ltr"><br><p></p><p></p><p></p><p></p></=
div></div>
</div></div></div></div>
</blockquote></div><br></div>

--089e0115fbcc86799c052b073b1f--