Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Date: Thu, 16 Feb 2017 05:24:43 -0800 (PST)
From: vinay patil <vinay18.patil@gmail.com>
To: user@flink.apache.org
Message-ID: <CAMpYU5Rvg2fbwgHRWtevqZJXWVF3LzTbmvgy1X1JTXVxZ6ZB7Q@mail.gmail.com>
In-Reply-To: <CAMhfALcNa2nF5ybijWfJ0jzKkumLJ4rgPgd5F9rHJfLp7MMz_A@mail.gmail.com>
References: <CAMhfALcNa2nF5ybijWfJ0jzKkumLJ4rgPgd5F9rHJfLp7MMz_A@mail.gmail.com>
Subject: Re: Resource under-utilization when using RocksDb state backend
 [SOLVED]
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_91026_2030327453.1487251483991"
archived-at: Thu, 16 Feb 2017 13:29:11 -0000

------=_Part_91026_2030327453.1487251483991
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hi Cliff,

It will be really helpful if you could share your RocksDB configuration.

I am also running on c3.4xlarge EC2 instances backed by SSD's .

I had tried with FLASH_SSD_OPTIMIZED option which works great but somehow
the pipeline stops in between and the overall processing time increases,

I tried to set different values as mentioned in this video, but somehow I
am not getting it right, the TM's is getting killed after sometime.


Regards,
Vinay Patil

On Thu, Dec 8, 2016 at 10:19 PM, Cliff Resnick [via Apache Flink User
Mailing List archive.] <ml-node+s2336050n10537h99@n4.nabble.com> wrote:

> It turns out that most of the time in RocksDBFoldingState was spent on
> serialization/deserializaton. RocksDb read/write was performing well. By
> moving from Kryo to custom serialization we were able to increase
> throughput dramatically. Load is now where it should be.
>
> On Mon, Dec 5, 2016 at 1:15 PM, Robert Metzger <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=10537&i=0>> wrote:
>
>> Another Flink user using RocksDB with large state on SSDs recently posted
>> this video for oprimizing the performance of Rocks on SSDs:
>> https://www.youtube.com/watch?v=pvUqbIeoPzM
>> That could be relevant for you.
>>
>> For how long did you look at iotop. It could be that the IO access
>> happens in bursts, depending on how data is cached.
>>
>> I'll also add Stefan Richter to the conversation, he has maybe some more
>> ideas what we can do here.
>>
>>
>> On Mon, Dec 5, 2016 at 6:19 PM, Cliff Resnick <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=10537&i=1>> wrote:
>>
>>> Hi Robert,
>>>
>>> We're following 1.2-SNAPSHOT,  using event time. I have tried "iotop"
>>> and I see usually less than 1 % IO. The most I've seen was a quick flash
>>> here or there of something substantial (e.g. 19%, 52%) then back to
>>> nothing. I also assumed we were disk-bound, but to use your metaphor I'm
>>> having trouble finding any smoke. However, I'm not very experienced in
>>> sussing out IO issues so perhaps there is something else I'm missing.
>>>
>>> I'll keep investigating. If I continue to come up empty then I guess my
>>> next steps may be to stage some independent tests directly against RocksDb.
>>>
>>> -Cliff
>>>
>>>
>>> On Mon, Dec 5, 2016 at 5:52 AM, Robert Metzger <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=10537&i=2>> wrote:
>>>
>>>> Hi Cliff,
>>>>
>>>> which Flink version are you using?
>>>> Are you using Eventtime or processing time windows?
>>>>
>>>> I suspect that your disks are "burning" (= your job is IO bound). Can
>>>> you check with a tool like "iotop" how much disk IO Flink is producing?
>>>> Then, I would set this number in relation with the theoretical maximum
>>>> of your SSD's (a good rough estimate is to use dd for that).
>>>>
>>>> If you find that your disk bandwidth is saturated by Flink, you could
>>>> look into tuning the RocksDB settings so that it uses more memory for
>>>> caching.
>>>>
>>>> Regards,
>>>> Robert
>>>>
>>>>
>>>> On Fri, Dec 2, 2016 at 11:34 PM, Cliff Resnick <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=10537&i=3>> wrote:
>>>>
>>>>> In tests comparing RocksDb to fs state backend we observe much lower
>>>>> throughput, around 10x slower. While the lowered throughput is expected,
>>>>> what's perplexing is that machine load is also very low with RocksDb,
>>>>> typically falling to  < 25% CPU and negligible IO wait (around 0.1%). Our
>>>>> test instances are EC2 c3.xlarge which are 4 virtual CPUs and 7.5G RAM,
>>>>> each running a single TaskManager in YARN, with 6.5G allocated memory per
>>>>> TaskManager. The instances also have 2x40G attached SSDs which we have
>>>>> mapped to `taskmanager.tmp.dir`.
>>>>>
>>>>> With FS state and 4 slots per TM, we will easily max out with an
>>>>> average load average around 5 or 6, so we actually need throttle down the
>>>>> slots to 3. With RocksDb using the Flink SSD configured options we see a
>>>>> load average at around 1. Also, load (and actual) throughput remain more or
>>>>> less constant no matter how many slots we use. The weak load is spread over
>>>>> all CPUs.
>>>>>
>>>>> Here is a sample top:
>>>>>
>>>>> Cpu0  : 20.5%us,  0.0%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>>>>>  0.0%st
>>>>> Cpu1  : 18.5%us,  0.0%sy,  0.0%ni, 81.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>>>>>  0.0%st
>>>>> Cpu2  : 11.6%us,  0.7%sy,  0.0%ni, 87.0%id,  0.7%wa,  0.0%hi,  0.0%si,
>>>>>  0.0%st
>>>>> Cpu3  : 12.5%us,  0.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.3%si,
>>>>>  0.0%st
>>>>>
>>>>> Our pipeline uses tumbling windows, each with a ValueState keyed to a
>>>>> 3-tuple of one string and two ints.. Each ValueState comprises a small set
>>>>> of tuples around 5-7 fields each. The WindowFunction simply diffs agains
>>>>> the set and updates state if there is a diff.
>>>>>
>>>>> Any ideas as to what the bottleneck is here? Any suggestions welcomed!
>>>>>
>>>>> -Cliff
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Re-Resource-under-utilization-when-using-
> RocksDb-state-backend-SOLVED-tp10537.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Resource-under-utilization-when-using-RocksDb-state-backend-SOLVED-tp10537p11678.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
------=_Part_91026_2030327453.1487251483991
Content-Type: text/html; charset=UTF8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Cliff,<br><br>It will be really helpful if you could sh=
are your RocksDB configuration.<br><br>I am also running on c3.4xlarge EC2 =
instances backed by SSD&#39;s .<div><br></div><div>I had tried with FLASH_S=
SD_OPTIMIZED option which works great but somehow the pipeline stops in bet=
ween and the overall processing time increases,<br><br>I tried to set diffe=
rent values as mentioned in this video, but somehow I am not getting it rig=
ht, the TM&#39;s is getting killed after sometime.<br>=C2=A0</div></div><di=
v class=3D"gmail_extra"><br clear=3D"all"><div><div class=3D"gmail_signatur=
e" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr=
"><font color=3D"#000000">Regards,</font><div><font color=3D"#000000">Vinay=
 Patil</font></div></div></div></div></div></div>
<br><div class=3D"gmail_quote">On Thu, Dec 8, 2016 at 10:19 PM, Cliff Resni=
ck [via Apache Flink User Mailing List archive.] <span dir=3D"ltr">&lt;<a h=
ref=3D"/user/SendEmail.jtp?type=3Dnode&node=3D11678&i=3D0" target=3D"_top" =
rel=3D"nofollow" link=3D"external">[hidden email]</a>&gt;</span> wrote:<br>=
<blockquote style=3D'border-left:2px solid #CCCCCC;padding:0 1em' class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex">

=09<div dir=3D"ltr">It turns out that most of the time in RocksDBFoldingSta=
te was spent on serialization/deserializaton. RocksDb read/write was perfor=
ming well. By moving from Kryo to custom serialization we were able to incr=
ease throughput dramatically. Load is now where it should be.=C2=A0<br><div=
 class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, Dec 5, 2016 a=
t 1:15 PM, Robert Metzger <span dir=3D"ltr">&lt;<a href=3D"http:///user/Sen=
dEmail.jtp?type=3Dnode&amp;node=3D10537&amp;i=3D0" rel=3D"nofollow" link=3D=
"external" target=3D"_blank">[hidden email]</a>&gt;</span> wrote:<br><block=
quote style=3D'border-left:2px solid #CCCCCC;padding:0 1em' style=3D"border=
-left:2px solid #cccccc;padding:0 1em" class=3D"gmail_quote"><div dir=3D"lt=
r">Another Flink user using RocksDB with large state on SSDs recently poste=
d this video for oprimizing the performance of Rocks on SSDs: <a href=3D"ht=
tps://www.youtube.com/watch?v=3DpvUqbIeoPzM" rel=3D"nofollow" link=3D"exter=
nal" target=3D"_blank">https://www.youtube.com/watch?<wbr>v=3DpvUqbIeoPzM</=
a><div>That could be relevant for you.</div><div><br></div><div>For how lon=
g did you look at iotop. It could be that the IO access happens in bursts, =
depending on how data is cached.</div><div><br></div><div>I&#39;ll also add=
 Stefan Richter to the conversation, he has maybe some more ideas what we c=
an do here.<br><div><div><br></div></div></div></div><div class=3D"m_738563=
8556979993794HOEnZb"><div class=3D"m_7385638556979993794h5"><div class=3D"g=
mail_extra"><br><div class=3D"gmail_quote">On Mon, Dec 5, 2016 at 6:19 PM, =
Cliff Resnick <span dir=3D"ltr">&lt;<a href=3D"http:///user/SendEmail.jtp?t=
ype=3Dnode&amp;node=3D10537&amp;i=3D1" rel=3D"nofollow" link=3D"external" t=
arget=3D"_blank">[hidden email]</a>&gt;</span> wrote:<br><blockquote style=
=3D'border-left:2px solid #CCCCCC;padding:0 1em' style=3D"border-left:2px s=
olid #cccccc;padding:0 1em" class=3D"gmail_quote"><div dir=3D"ltr"><div>Hi =
Robert,</div><div><br></div>We&#39;re following 1.2-SNAPSHOT, =C2=A0using e=
vent time. I have tried &quot;iotop&quot; and I see usually less than 1 % I=
O. The most I&#39;ve seen was a quick flash here or there of something subs=
tantial (e.g. 19%, 52%) then back to nothing. I also assumed we were disk-b=
ound, but to use your metaphor I&#39;m having trouble finding any smoke. Ho=
wever, I&#39;m not very experienced in sussing out IO issues so perhaps the=
re is something else I&#39;m missing.<div><br></div><div>I&#39;ll keep inve=
stigating. If I continue to come up empty then I guess my next steps may be=
 to stage some independent tests directly against RocksDb.</div><span class=
=3D"m_7385638556979993794m_-6472430108946815682HOEnZb"><font color=3D"#8888=
88"><div><br></div><div><div>-Cliff<br><div><div><div><br></div></div></div=
></div></div></font></span></div><div class=3D"m_7385638556979993794m_-6472=
430108946815682HOEnZb"><div class=3D"m_7385638556979993794m_-64724301089468=
15682h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, =
Dec 5, 2016 at 5:52 AM, Robert Metzger <span dir=3D"ltr">&lt;<a href=3D"htt=
p:///user/SendEmail.jtp?type=3Dnode&amp;node=3D10537&amp;i=3D2" rel=3D"nofo=
llow" link=3D"external" target=3D"_blank">[hidden email]</a>&gt;</span> wro=
te:<br><blockquote style=3D'border-left:2px solid #CCCCCC;padding:0 1em' st=
yle=3D"border-left:2px solid #cccccc;padding:0 1em" class=3D"gmail_quote"><=
div dir=3D"ltr">Hi Cliff,<div><br></div><div>which Flink version are you us=
ing?</div><div>Are you using Eventtime or processing time windows?</div><di=
v><br></div><div>I suspect that your disks are &quot;burning&quot; (=3D you=
r job is IO bound). Can you check with a tool like &quot;iotop&quot; how mu=
ch disk IO Flink is producing?</div><div>Then, I would set this number in r=
elation with the theoretical maximum of your SSD&#39;s (a good rough estima=
te is to use dd for that).</div><div><br></div><div>If you find that your d=
isk bandwidth is saturated by Flink, you could look into tuning the RocksDB=
 settings so that it uses more memory for caching.</div><div><br></div><div=
>Regards,</div><div>Robert=C2=A0</div><div><br></div></div><div class=3D"m_=
7385638556979993794m_-6472430108946815682m_-1618889807348818185HOEnZb"><div=
 class=3D"m_7385638556979993794m_-6472430108946815682m_-1618889807348818185=
h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Fri, Dec 2=
, 2016 at 11:34 PM, Cliff Resnick <span dir=3D"ltr">&lt;<a href=3D"http:///=
user/SendEmail.jtp?type=3Dnode&amp;node=3D10537&amp;i=3D3" rel=3D"nofollow"=
 link=3D"external" target=3D"_blank">[hidden email]</a>&gt;</span> wrote:<b=
r><blockquote style=3D'border-left:2px solid #CCCCCC;padding:0 1em' style=
=3D"border-left:2px solid #cccccc;padding:0 1em" class=3D"gmail_quote"><div=
 dir=3D"ltr">In tests comparing RocksDb to fs state backend we observe much=
 lower throughput, around 10x slower. While the lowered throughput is expec=
ted, what&#39;s perplexing is that machine load is also very low with Rocks=
Db, typically falling to =C2=A0&lt; 25% CPU and negligible IO wait (around =
0.1%). Our test instances are EC2 c3.xlarge which are 4 virtual CPUs and 7.=
5G RAM, each running a single TaskManager in YARN, with 6.5G allocated memo=
ry per TaskManager. The instances also have 2x40G attached SSDs which we ha=
ve mapped to `taskmanager.tmp.dir`.=C2=A0<br><div><br></div><div>With FS st=
ate and 4 slots per TM, we will easily max out with an average load average=
 around 5 or 6, so we actually need throttle down the slots to 3. With Rock=
sDb using the Flink SSD configured options we see a load average at around =
1. Also, load (and actual) throughput remain more or less constant no matte=
r how many slots we use. The weak load is spread over all CPUs.</div><div><=
br></div><div>Here is a sample top:</div><div><br></div><div><div>Cpu0 =C2=
=A0: 20.5%us, =C2=A00.0%sy, =C2=A00.0%ni, 79.5%id, =C2=A00.0%wa, =C2=A00.0%=
hi, =C2=A00.0%si, =C2=A00.0%st</div><div>Cpu1 =C2=A0: 18.5%us, =C2=A00.0%sy=
, =C2=A00.0%ni, 81.5%id, =C2=A00.0%wa, =C2=A00.0%hi, =C2=A00.0%si, =C2=A00.=
0%st</div><div>Cpu2 =C2=A0: 11.6%us, =C2=A00.7%sy, =C2=A00.0%ni, 87.0%id, =
=C2=A00.7%wa, =C2=A00.0%hi, =C2=A00.0%si, =C2=A00.0%st</div><div>Cpu3 =C2=
=A0: 12.5%us, =C2=A00.3%sy, =C2=A00.0%ni, 86.8%id, =C2=A00.0%wa, =C2=A00.0%=
hi, =C2=A00.3%si, =C2=A00.0%st</div></div><div><br></div><div>Our pipeline =
uses tumbling windows, each with a ValueState keyed to a 3-tuple of one str=
ing and two ints.. Each ValueState comprises a small set of tuples around 5=
-7 fields each. The WindowFunction simply diffs agains the set and updates =
state if there is a diff.</div><div><br></div><div>Any ideas as to what the=
 bottleneck is here? Any suggestions welcomed!</div><span class=3D"m_738563=
8556979993794m_-6472430108946815682m_-1618889807348818185m_3756594345929069=
962HOEnZb"><font color=3D"#888888"><div><br></div><div>-Cliff</div><div><br=
></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br=
></div></font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div>


=09
=09
=09
=09<br>
=09<br>
=09<hr noshade size=3D"1" color=3D"#cccccc">
=09<div style=3D"color:#444;font:12px tahoma,geneva,helvetica,arial,sans-se=
rif">
=09=09<div style=3D"font-weight:bold">If you reply to this email, your mess=
age will be added to the discussion below:</div>
=09=09<a href=3D"http://apache-flink-user-mailing-list-archive.2336050.n4.n=
abble.com/Re-Resource-under-utilization-when-using-RocksDb-state-backend-SO=
LVED-tp10537.html" target=3D"_blank" rel=3D"nofollow" link=3D"external">htt=
p://apache-flink-user-<wbr>mailing-list-archive.2336050.<wbr>n4.nabble.com/=
Re-Resource-<wbr>under-utilization-when-using-<wbr>RocksDb-state-backend-SO=
LVED-<wbr>tp10537.html</a>
=09</div>
=09<div style=3D"color:#666;font:11px tahoma,geneva,helvetica,arial,sans-se=
rif;margin-top:.4em;line-height:1.5em">
=09=09To start a new topic under Apache Flink User Mailing List archive., e=
mail <a href=3D"/user/SendEmail.jtp?type=3Dnode&node=3D11678&i=3D1" target=
=3D"_top" rel=3D"nofollow" link=3D"external">[hidden email]</a> <br>
=09=09To unsubscribe from Apache Flink User Mailing List archive., <a href=
=3D"" target=3D"_blank" rel=3D"nofollow" link=3D"external">click here</a>.<=
br>
=09=09<a href=3D"http://apache-flink-user-mailing-list-archive.2336050.n4.n=
abble.com/template/NamlServlet.jtp?macro=3Dmacro_viewer&amp;id=3Dinstant_ht=
ml%21nabble%3Aemail.naml&amp;base=3Dnabble.naml.namespaces.BasicNamespace-n=
abble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespa=
ce-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNam=
espace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.Basi=
cNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.templat=
e.NodeNamespace&amp;breadcrumbs=3Dnotify_subscribers%21nabble%3Aemail.naml-=
instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.na=
ml" rel=3D"nofollow" style=3D"font:9px serif" target=3D"_blank" link=3D"ext=
ernal">NAML</a>
=09</div></blockquote></div><br></div>


=09
=09
=09
<br/><hr align=3D"left" width=3D"300" />
View this message in context: <a href=3D"http://apache-flink-user-mailing-l=
ist-archive.2336050.n4.nabble.com/Re-Resource-under-utilization-when-using-=
RocksDb-state-backend-SOLVED-tp10537p11678.html">Re: Resource under-utiliza=
tion when using RocksDb state backend [SOLVED]</a><br/>
Sent from the <a href=3D"http://apache-flink-user-mailing-list-archive.2336=
050.n4.nabble.com/">Apache Flink User Mailing List archive. mailing list ar=
chive</a> at Nabble.com.<br/>
------=_Part_91026_2030327453.1487251483991--