Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6142D200C1D for ; Thu, 16 Feb 2017 14:29:11 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 5FC42160B61; Thu, 16 Feb 2017 13:29:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 37E8F160B57 for ; Thu, 16 Feb 2017 14:29:10 +0100 (CET) Received: (qmail 49888 invoked by uid 500); 16 Feb 2017 13:29:05 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 49876 invoked by uid 99); 16 Feb 2017 13:29:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Feb 2017 13:29:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 2128E185F35 for ; Thu, 16 Feb 2017 13:29:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.487 X-Spam-Level: ***** X-Spam-Status: No, score=5.487 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, HTML_MESSAGE=2, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_SOFTFAIL=0.972, URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 5nuIMqMnd-bk for ; Thu, 16 Feb 2017 13:29:00 +0000 (UTC) Received: from mwork.nabble.com (mwork.nabble.com [162.253.133.43]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 8836560DB7 for ; Thu, 16 Feb 2017 13:29:00 +0000 (UTC) Received: from mjoe.nabble.com (unknown [162.253.133.57]) by mwork.nabble.com (Postfix) with ESMTP id 24C722D3486C6 for ; Thu, 16 Feb 2017 06:28:48 -0700 (MST) Date: Thu, 16 Feb 2017 05:24:43 -0800 (PST) From: vinay patil To: user@flink.apache.org Message-ID: In-Reply-To: References: Subject: Re: Resource under-utilization when using RocksDb state backend [SOLVED] MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_91026_2030327453.1487251483991" archived-at: Thu, 16 Feb 2017 13:29:11 -0000 ------=_Part_91026_2030327453.1487251483991 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi Cliff, It will be really helpful if you could share your RocksDB configuration. I am also running on c3.4xlarge EC2 instances backed by SSD's . I had tried with FLASH_SSD_OPTIMIZED option which works great but somehow the pipeline stops in between and the overall processing time increases, I tried to set different values as mentioned in this video, but somehow I am not getting it right, the TM's is getting killed after sometime. Regards, Vinay Patil On Thu, Dec 8, 2016 at 10:19 PM, Cliff Resnick [via Apache Flink User Mailing List archive.] wrote: > It turns out that most of the time in RocksDBFoldingState was spent on > serialization/deserializaton. RocksDb read/write was performing well. By > moving from Kryo to custom serialization we were able to increase > throughput dramatically. Load is now where it should be. > > On Mon, Dec 5, 2016 at 1:15 PM, Robert Metzger <[hidden email] > > wrote: > >> Another Flink user using RocksDB with large state on SSDs recently posted >> this video for oprimizing the performance of Rocks on SSDs: >> https://www.youtube.com/watch?v=pvUqbIeoPzM >> That could be relevant for you. >> >> For how long did you look at iotop. It could be that the IO access >> happens in bursts, depending on how data is cached. >> >> I'll also add Stefan Richter to the conversation, he has maybe some more >> ideas what we can do here. >> >> >> On Mon, Dec 5, 2016 at 6:19 PM, Cliff Resnick <[hidden email] >> > wrote: >> >>> Hi Robert, >>> >>> We're following 1.2-SNAPSHOT, using event time. I have tried "iotop" >>> and I see usually less than 1 % IO. The most I've seen was a quick flash >>> here or there of something substantial (e.g. 19%, 52%) then back to >>> nothing. I also assumed we were disk-bound, but to use your metaphor I'm >>> having trouble finding any smoke. However, I'm not very experienced in >>> sussing out IO issues so perhaps there is something else I'm missing. >>> >>> I'll keep investigating. If I continue to come up empty then I guess my >>> next steps may be to stage some independent tests directly against RocksDb. >>> >>> -Cliff >>> >>> >>> On Mon, Dec 5, 2016 at 5:52 AM, Robert Metzger <[hidden email] >>> > wrote: >>> >>>> Hi Cliff, >>>> >>>> which Flink version are you using? >>>> Are you using Eventtime or processing time windows? >>>> >>>> I suspect that your disks are "burning" (= your job is IO bound). Can >>>> you check with a tool like "iotop" how much disk IO Flink is producing? >>>> Then, I would set this number in relation with the theoretical maximum >>>> of your SSD's (a good rough estimate is to use dd for that). >>>> >>>> If you find that your disk bandwidth is saturated by Flink, you could >>>> look into tuning the RocksDB settings so that it uses more memory for >>>> caching. >>>> >>>> Regards, >>>> Robert >>>> >>>> >>>> On Fri, Dec 2, 2016 at 11:34 PM, Cliff Resnick <[hidden email] >>>> > wrote: >>>> >>>>> In tests comparing RocksDb to fs state backend we observe much lower >>>>> throughput, around 10x slower. While the lowered throughput is expected, >>>>> what's perplexing is that machine load is also very low with RocksDb, >>>>> typically falling to < 25% CPU and negligible IO wait (around 0.1%). Our >>>>> test instances are EC2 c3.xlarge which are 4 virtual CPUs and 7.5G RAM, >>>>> each running a single TaskManager in YARN, with 6.5G allocated memory per >>>>> TaskManager. The instances also have 2x40G attached SSDs which we have >>>>> mapped to `taskmanager.tmp.dir`. >>>>> >>>>> With FS state and 4 slots per TM, we will easily max out with an >>>>> average load average around 5 or 6, so we actually need throttle down the >>>>> slots to 3. With RocksDb using the Flink SSD configured options we see a >>>>> load average at around 1. Also, load (and actual) throughput remain more or >>>>> less constant no matter how many slots we use. The weak load is spread over >>>>> all CPUs. >>>>> >>>>> Here is a sample top: >>>>> >>>>> Cpu0 : 20.5%us, 0.0%sy, 0.0%ni, 79.5%id, 0.0%wa, 0.0%hi, 0.0%si, >>>>> 0.0%st >>>>> Cpu1 : 18.5%us, 0.0%sy, 0.0%ni, 81.5%id, 0.0%wa, 0.0%hi, 0.0%si, >>>>> 0.0%st >>>>> Cpu2 : 11.6%us, 0.7%sy, 0.0%ni, 87.0%id, 0.7%wa, 0.0%hi, 0.0%si, >>>>> 0.0%st >>>>> Cpu3 : 12.5%us, 0.3%sy, 0.0%ni, 86.8%id, 0.0%wa, 0.0%hi, 0.3%si, >>>>> 0.0%st >>>>> >>>>> Our pipeline uses tumbling windows, each with a ValueState keyed to a >>>>> 3-tuple of one string and two ints.. Each ValueState comprises a small set >>>>> of tuples around 5-7 fields each. The WindowFunction simply diffs agains >>>>> the set and updates state if there is a diff. >>>>> >>>>> Any ideas as to what the bottleneck is here? Any suggestions welcomed! >>>>> >>>>> -Cliff >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/Re-Resource-under-utilization-when-using- > RocksDb-state-backend-SOLVED-tp10537.html > To start a new topic under Apache Flink User Mailing List archive., email > ml-node+s2336050n1h83@n4.nabble.com > To unsubscribe from Apache Flink User Mailing List archive., click here > > . > NAML > > -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Resource-under-utilization-when-using-RocksDb-state-backend-SOLVED-tp10537p11678.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. ------=_Part_91026_2030327453.1487251483991 Content-Type: text/html; charset=UTF8 Content-Transfer-Encoding: quoted-printable
Hi Cliff,

It will be really helpful if you could sh= are your RocksDB configuration.

I am also running on c3.4xlarge EC2 = instances backed by SSD's .

I had tried with FLASH_S= SD_OPTIMIZED option which works great but somehow the pipeline stops in bet= ween and the overall processing time increases,

I tried to set diffe= rent values as mentioned in this video, but somehow I am not getting it rig= ht, the TM's is getting killed after sometime.
=C2=A0

Regards,
Vinay= Patil

On Thu, Dec 8, 2016 at 10:19 PM, Cliff Resni= ck [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:
=
=09
It turns out that most of the time in RocksDBFoldingSta= te was spent on serialization/deserializaton. RocksDb read/write was perfor= ming well. By moving from Kryo to custom serialization we were able to incr= ease throughput dramatically. Load is now where it should be.=C2=A0

On Mon, Dec 5, 2016 a= t 1:15 PM, Robert Metzger <[hidden email]> wrote:

On Mon, Dec 5, 2016 at 6:19 PM, = Cliff Resnick <[hidden email]> wrote:
Hi = Robert,

We're following 1.2-SNAPSHOT, =C2=A0using e= vent time. I have tried "iotop" and I see usually less than 1 % I= O. The most I've seen was a quick flash here or there of something subs= tantial (e.g. 19%, 52%) then back to nothing. I also assumed we were disk-b= ound, but to use your metaphor I'm having trouble finding any smoke. Ho= wever, I'm not very experienced in sussing out IO issues so perhaps the= re is something else I'm missing.

I'll keep inve= stigating. If I continue to come up empty then I guess my next steps may be= to stage some independent tests directly against RocksDb.

-Cliff


On Mon, = Dec 5, 2016 at 5:52 AM, Robert Metzger <[hidden email]> wro= te:
<= div dir=3D"ltr">Hi Cliff,

which Flink version are you us= ing?
Are you using Eventtime or processing time windows?

I suspect that your disks are "burning" (=3D you= r job is IO bound). Can you check with a tool like "iotop" how mu= ch disk IO Flink is producing?
Then, I would set this number in r= elation with the theoretical maximum of your SSD's (a good rough estima= te is to use dd for that).

If you find that your d= isk bandwidth is saturated by Flink, you could look into tuning the RocksDB= settings so that it uses more memory for caching.

Regards,
Robert=C2=A0


On Fri, Dec 2= , 2016 at 11:34 PM, Cliff Resnick <[hidden email]> wrote:
In tests comparing RocksDb to fs state backend we observe much= lower throughput, around 10x slower. While the lowered throughput is expec= ted, what's perplexing is that machine load is also very low with Rocks= Db, typically falling to =C2=A0< 25% CPU and negligible IO wait (around = 0.1%). Our test instances are EC2 c3.xlarge which are 4 virtual CPUs and 7.= 5G RAM, each running a single TaskManager in YARN, with 6.5G allocated memo= ry per TaskManager. The instances also have 2x40G attached SSDs which we ha= ve mapped to `taskmanager.tmp.dir`.=C2=A0

With FS st= ate and 4 slots per TM, we will easily max out with an average load average= around 5 or 6, so we actually need throttle down the slots to 3. With Rock= sDb using the Flink SSD configured options we see a load average at around = 1. Also, load (and actual) throughput remain more or less constant no matte= r how many slots we use. The weak load is spread over all CPUs.
<= br>
Here is a sample top:

Cpu0 =C2= =A0: 20.5%us, =C2=A00.0%sy, =C2=A00.0%ni, 79.5%id, =C2=A00.0%wa, =C2=A00.0%= hi, =C2=A00.0%si, =C2=A00.0%st
Cpu1 =C2=A0: 18.5%us, =C2=A00.0%sy= , =C2=A00.0%ni, 81.5%id, =C2=A00.0%wa, =C2=A00.0%hi, =C2=A00.0%si, =C2=A00.= 0%st
Cpu2 =C2=A0: 11.6%us, =C2=A00.7%sy, =C2=A00.0%ni, 87.0%id, = =C2=A00.7%wa, =C2=A00.0%hi, =C2=A00.0%si, =C2=A00.0%st
Cpu3 =C2= =A0: 12.5%us, =C2=A00.3%sy, =C2=A00.0%ni, 86.8%id, =C2=A00.0%wa, =C2=A00.0%= hi, =C2=A00.3%si, =C2=A00.0%st

Our pipeline = uses tumbling windows, each with a ValueState keyed to a 3-tuple of one str= ing and two ints.. Each ValueState comprises a small set of tuples around 5= -7 fields each. The WindowFunction simply diffs agains the set and updates = state if there is a diff.

Any ideas as to what the= bottleneck is here? Any suggestions welcomed!

-Cliff








=09 =09 =09 =09
=09
=09
=09 =09
=09=09To start a new topic under Apache Flink User Mailing List archive., e= mail [hidden email]
=09=09To unsubscribe from Apache Flink User Mailing List archive., click here.<= br> =09=09NAML =09

=09 =09 =09

View this message in context: Re: Resource under-utiliza= tion when using RocksDb state backend [SOLVED]
Sent from the Apache Flink User Mailing List archive. mailing list ar= chive at Nabble.com.
------=_Part_91026_2030327453.1487251483991--