Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CB07C200C3D for ; Tue, 14 Mar 2017 17:39:21 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C95E5160B7E; Tue, 14 Mar 2017 16:39:21 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 08CDD160B63 for ; Tue, 14 Mar 2017 17:39:19 +0100 (CET) Received: (qmail 26000 invoked by uid 500); 14 Mar 2017 16:39:18 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 25991 invoked by uid 99); 14 Mar 2017 16:39:18 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Mar 2017 16:39:18 +0000 Received: from mail-io0-f179.google.com (mail-io0-f179.google.com [209.85.223.179]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 023331A0193 for ; Tue, 14 Mar 2017 16:39:18 +0000 (UTC) Received: by mail-io0-f179.google.com with SMTP id f84so3231020ioj.0 for ; Tue, 14 Mar 2017 09:39:17 -0700 (PDT) X-Gm-Message-State: AFeK/H0eY8EF4onqcCXrublsOHeLD+SYYddnF8e6unLkf7jPbtmEjo/fD9jl4oEiDFSQpTZQ6mh3dY9x1QfKtQ== X-Received: by 10.107.46.24 with SMTP id i24mr926552ioo.84.1489509557366; Tue, 14 Mar 2017 09:39:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.8.198 with HTTP; Tue, 14 Mar 2017 09:39:16 -0700 (PDT) In-Reply-To: <7C706D03-76FE-422B-A4C0-E1A2C0660E1A@data-artisans.com> References: <1487941734848-11879.post@n4.nabble.com> <1487944701219-11882.post@n4.nabble.com> <78EE2263-A2CA-49C3-A86A-8793E56CB394@mediamath.com> <1948A710-5EDD-4AD2-91AC-381883F877C1@mediamath.com> <7C706D03-76FE-422B-A4C0-E1A2C0660E1A@data-artisans.com> From: Stephan Ewen Date: Tue, 14 Mar 2017 17:39:16 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Checkpointing with RocksDB as statebackend To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a11c1662a0e8b05054ab378cd archived-at: Tue, 14 Mar 2017 16:39:22 -0000 --001a11c1662a0e8b05054ab378cd Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable The issue in Flink is https://issues.apache.org/jira/browse/FLINK-5756 On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter wrote: > Hi Vinay, > > I think the issue is tracked here: https://github.com/ > facebook/rocksdb/issues/1988. > > Best, > Stefan > > Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath < > vishnu.viswanath25@gmail.com>: > > Hi Stephan, > > Is there a ticket number/link to track this, My job has all the condition= s > you mentioned. > > Thanks, > Vishnu > > On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen wrote: > >> Hi Vinay! >> >> We just discovered a bug in RocksDB. The bug affects windows without >> reduce() or fold(), windows with evictors, and ListState. >> >> A certain access pattern in RocksDB starts being so slow after a certain >> size-per-key that it basically brings down the streaming program and the >> snapshots. >> >> We are reaching out to the RocksDB folks and looking for workarounds in >> Flink. >> >> Greetings, >> Stephan >> >> >> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen wrote: >> >>> @vinay Can you try to not set the buffer timeout at all? I am actually >>> not sure what would be the effect of setting it to a negative value, th= at >>> can be a cause of problems... >>> >>> >>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman >>> wrote: >>> >>>> Vinay, >>>> >>>> >>>> >>>> The bucketing sink performs rename operations during the checkpoint an= d >>>> if it tries to rename a file that is not yet consistent that would cau= se a >>>> FileNotFound exception which would fail the checkpoint. >>>> >>>> >>>> >>>> Stephan, >>>> >>>> >>>> >>>> Currently my aws fork contains some very specific assumptions about th= e >>>> pipeline that will in general only hold for my pipeline. This is becau= se >>>> there were still some open questions that I had about how to solve >>>> consistency issues in the general case. I will comment on the Jira iss= ue >>>> with more specific. >>>> >>>> >>>> >>>> Seth Wiesman >>>> >>>> >>>> >>>> *From: *vinay patil >>>> *Reply-To: *"user@flink.apache.org" >>>> *Date: *Monday, February 27, 2017 at 1:05 PM >>>> *To: *"user@flink.apache.org" >>>> >>>> *Subject: *Re: Checkpointing with RocksDB as statebackend >>>> >>>> >>>> >>>> Hi Seth, >>>> >>>> Thank you for your suggestion. >>>> >>>> But if the issue is only related to S3, then why does this happen when >>>> I replace the S3 sink to HDFS as well (for checkpointing I am using H= DFS >>>> only ) >>>> >>>> Stephan, >>>> >>>> Another issue I see is when I set env.setBufferTimeout(-1) , and keep >>>> the checkpoint interval to 10minutes, I have observed that nothing get= s >>>> written to sink (tried with S3 as well as HDFS), atleast I was expecti= ng >>>> pending files here. >>>> >>>> This issue gets worst when checkpointing is disabled as nothing is >>>> written. >>>> >>>> >>>> >>>> >>>> Regards, >>>> >>>> Vinay Patil >>>> >>>> >>>> >>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink User >>>> Mailing List archive.] <[hidden email]> wrote: >>>> >>>> Hi Seth! >>>> >>>> >>>> >>>> Wow, that is an awesome approach. >>>> >>>> >>>> >>>> We have actually seen these issues as well and we are looking to >>>> eventually implement our own S3 file system (and circumvent Hadoop's S= 3 >>>> connector that Flink currently relies on): https://issues.apache.org >>>> /jira/browse/FLINK-5706 >>>> >>>> >>>> >>>> Do you think your patch would be a good starting point for that and >>>> would you be willing to share it? >>>> >>>> >>>> >>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is possible >>>> to fork officially, if necessary... >>>> >>>> >>>> >>>> Greetings, >>>> >>>> Stephan >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email] >>>> > wrote: >>>> >>>> Just wanted to throw in my 2cts. >>>> >>>> >>>> >>>> I=E2=80=99ve been running pipelines with similar state size using rock= sdb which >>>> externalize to S3 and bucket to S3. I was getting stalls like this and >>>> ended up tracing the problem to S3 and the bucketing sink. The solutio= n was >>>> two fold: >>>> >>>> >>>> >>>> 1) I forked hadoop-aws and have it treat flink as a source of >>>> truth. Emr uses a dynamodb table to determine if S3 is inconsistent. >>>> Instead I say that if flink believes that a file exists on S3 and we d= on=E2=80=99t >>>> see it then I am going to trust that flink is in a consistent state an= d S3 >>>> is not. In this case, various operations will perform a back off and r= etry >>>> up to a certain number of times. >>>> >>>> >>>> >>>> 2) The bucketing sink performs multiple renames over the >>>> lifetime of a file, occurring when a checkpoint starts and then again = on >>>> notification after it completes. Due to S3=E2=80=99s consistency guara= ntees the >>>> second rename of file can never be assured to work and will eventually= fail >>>> either during or after a checkpoint. Because there is no upper bound o= n the >>>> time it will take for a file on S3 to become consistent, retries canno= t >>>> solve this specific problem as it could take upwards of many minutes t= o >>>> rename which would stall the entire pipeline. The only viable solution= I >>>> could find was to write a custom sink which understands S3. Each write= r >>>> will write file locally and then copy it to S3 on checkpoint. By only >>>> interacting with S3 once per file it can circumvent consistency issues= all >>>> together. >>>> >>>> >>>> >>>> Hope this helps, >>>> >>>> >>>> >>>> Seth Wiesman >>>> >>>> >>>> >>>> *From: *vinay patil <[hidden email] >>>> > >>>> *Reply-To: *"[hidden email] >>>> " <[hidden e= mail] >>>> > >>>> *Date: *Saturday, February 25, 2017 at 10:50 AM >>>> *To: *"[hidden email] >>>> " <[hidden e= mail] >>>> > >>>> *Subject: *Re: Checkpointing with RocksDB as statebackend >>>> >>>> >>>> >>>> HI Stephan, >>>> >>>> Just to avoid the confusion here, I am using S3 sink for writing the >>>> data, and using HDFS for storing checkpoints. >>>> >>>> There are 2 core nodes (HDFS) and two task nodes on EMR >>>> >>>> >>>> I replaced s3 sink with HDFS for writing data in my last test. >>>> >>>> Let's say the checkpoint interval is 5 minutes, now within 5minutes of >>>> run the state size grows to 30GB , after checkpointing the 30GB state= that >>>> is maintained in rocksDB has to be copied to HDFS, right ? is this ca= using >>>> the pipeline to stall ? >>>> >>>> >>>> Regards, >>>> >>>> Vinay Patil >>>> >>>> >>>> >>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]> wrote: >>>> >>>> Hi Stephan, >>>> >>>> To verify if S3 is making teh pipeline stall, I have replaced the S3 >>>> sink with HDFS and kept minimum pause between checkpoints to 5minutes, >>>> still I see the same issue with checkpoints getting failed. >>>> >>>> If I keep the pause time to 20 seconds, all checkpoints are completed >>>> , however there is a hit in overall throughput. >>>> >>>> >>>> >>>> >>>> Regards, >>>> >>>> Vinay Patil >>>> >>>> >>>> >>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink User >>>> Mailing List archive.] <[hidden email]> wrote: >>>> >>>> Flink's state backends currently do a good number of "make sure this >>>> exists" operations on the file systems. Through Hadoop's S3 filesystem= , >>>> that translates to S3 bucket list operations, where there is a limit i= n how >>>> many operation may happen per time interval. After that, S3 blocks. >>>> >>>> >>>> >>>> It seems that operations that are totally cheap on HDFS are hellishly >>>> expensive (and limited) on S3. It may be that you are affected by that= . >>>> >>>> >>>> >>>> We are gradually trying to improve the behavior there and be more S3 >>>> aware. >>>> >>>> >>>> >>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there. >>>> >>>> >>>> >>>> Best, >>>> >>>> Stephan >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email] >>>> > wrote: >>>> >>>> Hi Stephan, >>>> >>>> So do you mean that S3 is causing the stall , as I have mentioned in m= y >>>> previous mail, I could not see any progress for 16minutes as checkpoin= ts >>>> were getting failed continuously. >>>> >>>> >>>> >>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing >>>> List archive.]" <[hidden email] >>>> > wrote: >>>> >>>> Hi Vinay! >>>> >>>> >>>> >>>> True, the operator state (like Kafka) is currently not asynchronously >>>> checkpointed. >>>> >>>> >>>> >>>> While it is rather small state, we have seen before that on S3 it can >>>> cause trouble, because S3 frequently stalls uploads of even data amoun= ts as >>>> low as kilobytes due to its throttling policies. >>>> >>>> >>>> >>>> That would be a super important fix to add! >>>> >>>> >>>> >>>> Best, >>>> >>>> Stephan >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email] >>>> > wrote: >>>> >>>> Hi, >>>> >>>> I have attached a snapshot for reference: >>>> As you can see all the 3 checkpointins failed , for checkpoint ID 2 an= d >>>> 3 it >>>> is stuck at the Kafka source after 50% >>>> (The data sent till now by Kafka source 1 is 65GB and sent by source 2 >>>> is >>>> 15GB ) >>>> >>>> Within 10minutes 15M records were processed, and for the next 16minute= s >>>> the >>>> pipeline is stuck , I don't see any progress beyond 15M because of >>>> checkpoints getting failed consistently. >>>> >>>> >>> bble.com/file/n11882/Checkpointing_Failed.png> >>>> >>>> >>>> >>>> -- >>>> View this message in context: http://apache-flink-user-maili >>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with- >>>> RocksDB-as-statebackend-tp11752p11882.html >>>> >>>> Sent from the Apache Flink User Mailing List archive. mailing list >>>> archive at Nabble.com. >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> *If you reply to this email, your message will be added to the >>>> discussion below:* >>>> >>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab >>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175 >>>> 2p11885.html >>>> >>>> To start a new topic under Apache Flink User Mailing List archive., >>>> email [hidden email] >>>> >>>> To unsubscribe from Apache Flink User Mailing List archive., click her= e. >>>> NAML >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> View this message in context: Re: Checkpointing with RocksDB as >>>> statebackend >>>> >>>> >>>> Sent from the Apache Flink User Mailing List archive. mailing list >>>> archive >>>> >>>> at Nabble.com. >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> *If you reply to this email, your message will be added to the >>>> discussion below:* >>>> >>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab >>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175 >>>> 2p11891.html >>>> >>>> To start a new topic under Apache Flink User Mailing List archive., >>>> email [hidden email] >>>> To unsubscribe from Apache Flink User Mailing List archive., click her= e. >>>> NAML >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> View this message in context: Re: Checkpointing with RocksDB as >>>> statebackend >>>> >>>> Sent from the Apache Flink User Mailing List archive. mailing list >>>> archive >>>> >>>> at Nabble.com. >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> *If you reply to this email, your message will be added to the >>>> discussion below:* >>>> >>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab >>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175 >>>> 2p11943.html >>>> >>>> To start a new topic under Apache Flink User Mailing List archive., >>>> email [hidden email] >>>> To unsubscribe from Apache Flink User Mailing List archive., click her= e >>>> . >>>> NAML >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> View this message in context: Re: Checkpointing with RocksDB as >>>> statebackend >>>> >>>> Sent from the Apache Flink User Mailing List archive. mailing list >>>> archive >>>> >>>> at Nabble.com. >>>> >>>> >>> >> > > --001a11c1662a0e8b05054ab378cd Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
The issue in Flink is=C2=A0https://issues.apache.org/jira/browse/FLINK-5= 756

On Tue, Ma= r 14, 2017 at 3:40 PM, Stefan Richter <s.richter@data-artisans.c= om> wrote:
Hi Vinay,

I think the issue is track= ed here:=C2=A0https://github.com/facebook/rocksdb/issues/1988.<= /div>

Best,
Stefan

Am 14.03.2017 um 15:31 schrie= b Vishnu Viswanath <vishnu.viswanath25@gmail.com>:

Hi S= tephan,

Is there a ticket number/link to track this, My = job has all the conditions you mentioned.

Thanks,<= /div>
Vishnu

<= div class=3D"m_-4861140091401466367HOEnZb">

On Wed, M= ar 1, 2017 at 12:10 PM, Stephan Ewen <sewen@apache.org> wrote= :
@vinay =C2=A0Can you t= ry to not set the buffer timeout at all? I am actually not sure what would = be the effect of setting it to a negative value, that can be a cause of pro= blems...
<= div class=3D"gmail_extra">

On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <sw= iesman@mediamath.com> wrote:

Vinay,

=C2=A0

The bucketing sink perfo= rms rename operations during the checkpoint and if it tries to rename a fil= e that is not yet consistent that would cause a FileNotFound exception whic= h would fail the checkpoint.

=C2=A0

<= p class=3D"MsoNormal">= Stephan,

=C2=A0

Currently my aws fork co= ntains some very specific assumptions about the pipeline that will in gener= al only hold for my pipeline. This is because there were still some open qu= estions that =C2=A0I had about how to solve consistency issues in the general case. I w= ill comment on the Jira issue with more specific.

=C2=A0

Seth Wiesman

=C2=A0

From= : vinay patil <vinay18.patil@gmail.com> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
Date: Monday, February 27, 2017 at 1:05 PM
To: "user@flink.apache.org" <user@flink.apache.org>

Subject: Re: Checkpointing with RocksDB as statebackend

=C2=A0

Hi Seth,<= u>

Thank you for y= our suggestion.

But if the issu= e is only related to S3, then why does this happen when I replace the S3 si= nk=C2=A0 to HDFS as well (for checkpointing I am using HDFS only )

Stephan,

Another issue I see is when I set env.setBuffe= rTimeout(-1) , and keep the checkpoint interval to 10minutes, I have observ= ed that nothing gets written to sink (tried with S3 as well as HDFS), atlea= st I was expecting pending files here.

This issue gets= worst when checkpointing is disabled=C2=A0 as nothing is written.

=C2=A0


Regards,

Vinay Patil

=C2=A0

On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen = [via Apache Flink User Mailing List archive.] <[hidden email]>= wrote:

Hi Seth!

=C2=A0

Wow, that is an awesome approach.=

=C2=A0

We have actually seen these issues as well and = we are looking to eventually implement our own S3 file system (and circumve= nt Hadoop's S3 connector that Flink currently relies on):=C2=A0htt= ps://issues.apache.org/jira/browse/FLINK-5706

=C2=A0

Do you think your patch would be a good startin= g point for that and would you be willing to share it?

=C2=A0

The Amazon AWS SDK for Java is Apache 2 license= d, so that is possible to fork officially, if necessary...

=C2=A0

Greetings,

Stephan

=C2=A0

=C2=A0

=C2=A0

On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman &= lt;[hidden email]> wrote:

Just wanted to throw in my 2cts. =C2=A0

=C2= =A0

I=E2=80=99ve been running pipelines with simil= ar state size using rocksdb which externalize to S3 and bucket to S3. I was= getting stalls like this and ended up tracing the problem to S3 and the bucketing sink. T= he solution was two fold:

=C2=A0

1)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I forked hadoop= -aws and have it treat flink as a source of truth. Emr uses a dynamodb tabl= e to determine if S3 is inconsistent. Instead I say that if flink believes = that a file exists on S3 and we don=E2=80=99t see it then I am going to trust that flink is in a consistent state and S3= is not. In this case, various operations will perform a back off and retry= up to a certain number of times.

=C2=A0

2)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 The bucketing s= ink performs multiple renames over the lifetime of a file, occurring when a= checkpoint starts and then again on notification after it completes. Due t= o S3=E2=80=99s consistency guarantees the second rename of file can never be assured to work and will eventually fai= l either during or after a checkpoint. Because there is no upper bound on t= he time it will take for a file on S3 to become consistent, retries cannot = solve this specific problem as it could take upwards of many minutes to rename which would stall the entire = pipeline. The only viable solution I could find was to write a custom sink = which understands S3. Each writer will write file locally and then copy it = to S3 on checkpoint. By only interacting with S3 once per file it can circumvent consistency issues all together. <= /span>

=C2=A0

Hope this helps,

=C2=A0

Seth Wiesman

=C2=A0

From= : vinay patil <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Saturday, February 25, 2017 at 10:50 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: Checkpointing with RocksDB as statebackend

=C2=A0

HI Stephan,

Just to avoid the confusion here, I am using S= 3 sink for writing the data, and using HDFS for storing checkpoints.=

There are 2 core nodes (HDFS) and two task nod= es on EMR


I replaced s3 sink with HDFS for writing data in my last test.

Let's say th= e checkpoint interval is 5 minutes, now within 5minutes of run the state si= ze grows to 30GB ,=C2=A0 after checkpointing the 30GB state that is maintai= ned in rocksDB has to be copied to HDFS, right ?=C2=A0 is this causing the pipeline to stall ?


Regards,

Vinay Patil

=C2=A0

On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil &= lt;[hidden email]> wrote:

Hi Stephan,

To verify if S3= is making teh pipeline stall, I have replaced the S3 sink with HDFS and ke= pt minimum pause between checkpoints to 5minutes, still I see the same issu= e with checkpoints getting failed.

If I keep the=C2=A0 pause time to 20 seconds, = all checkpoints are completed , however there is a hit in overall throughpu= t.

=C2=A0


Regards,

Vinay Patil

=C2=A0

On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen = [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:=

Flink's state backends currently do a good = number of "make sure this exists" operations on the file systems.= Through Hadoop's S3 filesystem, that translates to S3 bucket list oper= ations, where there is a limit in how many operation may happen per time interval.= After that, S3 blocks.

=C2=A0

It seems that operations that are totally cheap= on HDFS are hellishly expensive (and limited) on S3. It may be that you ar= e affected by that.

=C2=A0

We are gradually trying to improve the behavior= there and be more S3 aware.

=C2=A0

Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already cont= ain improvements there.

=C2=A0

Best,

Stephan

=C2=A0

=C2=A0

On Fri, Feb 24, 2017 at 4:42 PM, vinay patil &l= t;[hidden email]> wrote:

Hi Stephan,

So do you mean that S3 is causing the stall , as I have ment= ioned in my previous mail, I could not see any progress for 16minutes as ch= eckpoints were getting failed continuously.

=C2=A0

On Feb 24, 2017 8:30 PM, "Stephan Ewen [vi= a Apache Flink User Mailing List archive.]" <[h= idden email]> wrote:

Hi Vinay!

=C2=A0

True, the operator state (like Kafka) is curren= tly not asynchronously checkpointed.

=C2=A0

While it is rather small state, we have seen be= fore that on S3 it can cause trouble, because S3 frequently stalls uploads = of even data amounts as low as kilobytes due to its throttling policies.

=C2=A0

That would be a super important fix to add!<= /u>

=C2=A0

Best,

Stephan

=C2=A0

=C2=A0

On Fri, Feb 24, 2017 at 2:58 PM, vinay patil &l= t;[hidden email]> wrote:

Hi,

I have attached a snapshot for reference:
As you can see all the 3 checkpointins failed , for checkpoint ID 2 and 3 i= t
is stuck at the Kafka source after 50%
(The data sent till now by Kafka source 1 is 65GB and sent by source 2 is 15GB )

Within 10minutes 15M records were processed, and for the next 16minutes the=
pipeline is stuck , I don't see any progress beyond 15M because of
checkpoints getting failed consistently.

<http://apac= he-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11= 882/Checkpointing_Failed.png>



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.c= om/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11882.ht= ml

Sent from the Apache Flink User Mailing List ar= chive. mailing list archive at Nabble.com.

=C2=A0

=C2=A0


To start a= new topic under Apache Flink User Mailing List archive., email [hidden email]
To unsubscribe from Apache Flink User Mailing List archive., click here. NAML

=C2=A0


View this= message in context: Re: Checkpointing with RocksDB as statebackend

=C2=A0

=C2=A0


To start a= new topic under Apache Flink User Mailing List archive., email [hidden ema= il]
To unsubscribe from Apache Flink User Mailing List archive., click here. NAML

=C2=A0

=C2=A0

=C2=A0


View this messa= ge in context: Re: Checkpointing with RocksDB as statebackend
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

=C2=A0

=C2=A0


T= o start a new topic under Apache Flink User Mailing List archive., email [hidden email]
To unsubscribe from Apache Flink User Mailing List archive., click here.
NAML

=C2=A0

=C2=A0


View this message in context: Re: Checkpointing with RocksDB as statebackend
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.





--001a11c1662a0e8b05054ab378cd--