Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BD65E200C22 for ; Tue, 21 Feb 2017 15:02:11 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id BBF15160B68; Tue, 21 Feb 2017 14:02:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E05E9160B3E for ; Tue, 21 Feb 2017 15:02:10 +0100 (CET) Received: (qmail 53398 invoked by uid 500); 21 Feb 2017 14:02:09 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 53388 invoked by uid 99); 21 Feb 2017 14:02:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Feb 2017 14:02:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 651B11A012C for ; Tue, 21 Feb 2017 14:02:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.486 X-Spam-Level: ***** X-Spam-Status: No, score=5.486 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, HTML_MESSAGE=2, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_SOFTFAIL=0.972, URI_HEX=1.313] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id yg0tggkWpO7q for ; Tue, 21 Feb 2017 14:02:08 +0000 (UTC) Received: from mwork.nabble.com (mwork.nabble.com [162.253.133.43]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 3DD4B5F4A8 for ; Tue, 21 Feb 2017 14:02:07 +0000 (UTC) Received: from mjoe.nabble.com (unknown [162.253.133.57]) by mwork.nabble.com (Postfix) with ESMTP id 8B5EF2E45FFA3 for ; Tue, 21 Feb 2017 07:02:06 -0700 (MST) Date: Tue, 21 Feb 2017 05:57:31 -0800 (PST) From: vinay patil To: user@flink.apache.org Message-ID: In-Reply-To: References: Subject: Re: Flink checkpointing gets stuck MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_46205_546413520.1487685451882" archived-at: Tue, 21 Feb 2017 14:02:11 -0000 ------=_Part_46205_546413520.1487685451882 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi Shai, I was facing similar issue , however now the stream is not stuck in between. you can refer this thread for the configurations I have done : http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-td11752.html What is the configuration on which you running the job ? What is the RocksDB predefined option you are using ? Regards, Vinay Patil On Tue, Feb 21, 2017 at 7:13 PM, Shai Kaplan [via Apache Flink User Mailing List archive.] wrote: > Hi. > > I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After > some running time (minutes-hours) Flink fails to save checkpoints, and > stops processing records (I'm not sure if the checkpointing failure is the > cause of the problem or just a symptom). > > After several checkpoints that take some seconds each, they start failing > due to 30 minutes timeout. > > When I restart one of the Task Manager services (just to get the job > restarted), the job is recovered from the last successful checkpoint (the > state size continues to grow, so it's probably not the reason for the > failure), advances somewhat, saves some more checkpoints, and then enters > the failing state again. > > One of the times it happened, the first failed checkpoint failed due to > "Checkpoint Coordinator is suspending.", so it might be an indicator for > the cause of the problem, but looking into Flink's code I can't see how a > running job could get to this state. > > I am using RocksDB for state, and the state is saved to Azure Blob Store, > using the NativeAzureFileSystem HDFS connector over the wasbs protocol. > > Any ideas? Possibly a bug in Flink or RocksDB? > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink- > checkpointing-gets-stuck-tp11776.html > To start a new topic under Apache Flink User Mailing List archive., email > ml-node+s2336050n1h83@n4.nabble.com > To unsubscribe from Apache Flink User Mailing List archive., click here > > . > NAML > > -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-tp11776p11778.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. ------=_Part_46205_546413520.1487685451882 Content-Type: text/html; charset=UTF8 Content-Transfer-Encoding: quoted-printable
Hi Shai,

I was facing similar issue , however now t= he stream is not stuck in between.

you can refer this thread fo= r the configurations I have done :=C2=A0http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re= -Checkpointing-with-RocksDB-as-statebackend-td11752.html

What is= the configuration on which you running the job ?
What is the RocksDB pr= edefined option you are using ?



Regards,
Vinay Patil

On Tue, Feb 21, 2017 at 7:13 PM, Shai Kaplan= [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:
=09

Hi.

I'm running a Flink 1.2 job with a 10 seconds checkpoint interva= l. After some running time (minutes-hours) Flink fails to save checkpoints,= and stops processing records (I'm not sure if the checkpointing failure is the cause of the problem or just a symptom= ).<= /u>

After several checkpoints that take some seconds each, they start fa= iling due to 30 minutes timeout.

When I restart one of the Task Manager services (just to get the job= restarted), the job is recovered from the last successful checkpoint (the = state size continues to grow, so it's probably not the reason for the failure), advances somewhat, saves some mo= re checkpoints, and then enters the failing state again.

One of the times it happened, the first failed checkpoint failed due= to "Checkpoint Coordinator is suspending.", so it might be an in= dicator for the cause of the problem, but looking into Flink's code I can't see how a running job could get to this = state.

I am using RocksDB for state, and the state is saved to Azure Blob S= tore, using the NativeAzureFileSystem HDFS connector over the wasbs protoco= l.

Any ideas? Possibly a bug in Flink or RocksDB?

=09 =09 =09 =09
=09
=09
=09
=09=09
If you reply to this email, your mess= age will be added to the discussion below:
=09=09http://apache-flink-user-mailing-list= -archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-tp11776.html =09
=09
=09=09To start a new topic under Apache Flink User Mailing List archive., e= mail [hidden email]
=09=09To unsubscribe from Apache Flink User Mailing List archive., click here.<= br> =09=09NAML<= /a> =09

=09 =09 =09

View this message in context:
Re: Flink checkpointing gets stuck
Sent from the Apache Flink User Mailing List archive. mailing list ar= chive at Nabble.com.
------=_Part_46205_546413520.1487685451882--