Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Date: Tue, 21 Feb 2017 05:57:31 -0800 (PST)
From: vinay patil <vinay18.patil@gmail.com>
To: user@flink.apache.org
Message-ID: <CAMpYU5S0uf3LobrQF+LM9+VbB8u0Jy6Hre7UHwpNn_f1yVUdmA@mail.gmail.com>
In-Reply-To: <MWHPR21MB0288A41E852D30D5CC2366ABE5510@MWHPR21MB0288.namprd21.prod.outlook.com>
References: <MWHPR21MB0288A41E852D30D5CC2366ABE5510@MWHPR21MB0288.namprd21.prod.outlook.com>
Subject: Re: Flink checkpointing gets stuck
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_46205_546413520.1487685451882"
archived-at: Tue, 21 Feb 2017 14:02:11 -0000

------=_Part_46205_546413520.1487685451882
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hi Shai,

I was facing similar issue , however now the stream is not stuck in between.

you can refer this thread for the configurations I have done :
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-td11752.html

What is the configuration on which you running the job ?
What is the RocksDB predefined option you are using ?


Regards,
Vinay Patil

On Tue, Feb 21, 2017 at 7:13 PM, Shai Kaplan [via Apache Flink User Mailing
List archive.] <ml-node+s2336050n11776h8@n4.nabble.com> wrote:

> Hi.
>
> I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After
> some running time (minutes-hours) Flink fails to save checkpoints, and
> stops processing records (I'm not sure if the checkpointing failure is the
> cause of the problem or just a symptom).
>
> After several checkpoints that take some seconds each, they start failing
> due to 30 minutes timeout.
>
> When I restart one of the Task Manager services (just to get the job
> restarted), the job is recovered from the last successful checkpoint (the
> state size continues to grow, so it's probably not the reason for the
> failure), advances somewhat, saves some more checkpoints, and then enters
> the failing state again.
>
> One of the times it happened, the first failed checkpoint failed due to
> "Checkpoint Coordinator is suspending.", so it might be an indicator for
> the cause of the problem, but looking into Flink's code I can't see how a
> running job could get to this state.
>
> I am using RocksDB for state, and the state is saved to Azure Blob Store,
> using the NativeAzureFileSystem HDFS connector over the wasbs protocol.
>
> Any ideas? Possibly a bug in Flink or RocksDB?
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-
> checkpointing-gets-stuck-tp11776.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-tp11776p11778.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
------=_Part_46205_546413520.1487685451882
Content-Type: text/html; charset=UTF8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Shai,<br><br>I was facing similar issue , however now t=
he stream is not stuck in between.<br><br><div>you can refer this thread fo=
r the configurations I have done :=C2=A0<a href=3D"http://apache-flink-user=
-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-a=
s-statebackend-td11752.html" target=3D"_top" rel=3D"nofollow" link=3D"exter=
nal">http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re=
-Checkpointing-with-RocksDB-as-statebackend-td11752.html</a><br><br>What is=
 the configuration on which you running the job ?<br>What is the RocksDB pr=
edefined option you are using ?<br><br><br></div></div><div class=3D"gmail_=
extra"><br clear=3D"all"><div><div class=3D"gmail_signature" data-smartmail=
=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><font color=3D"=
#000000">Regards,</font><div><font color=3D"#000000">Vinay Patil</font></di=
v></div></div></div></div></div>
<br><div class=3D"gmail_quote">On Tue, Feb 21, 2017 at 7:13 PM, Shai Kaplan=
 [via Apache Flink User Mailing List archive.] <span dir=3D"ltr">&lt;<a hre=
f=3D"/user/SendEmail.jtp?type=3Dnode&node=3D11778&i=3D0" target=3D"_top" re=
l=3D"nofollow" link=3D"external">[hidden email]</a>&gt;</span> wrote:<br><b=
lockquote style=3D'border-left:2px solid #CCCCCC;padding:0 1em' class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">

=09


<div class=3D"m_-1858524454712487971WordSection1">
<p class=3D"MsoNormal" style=3D"text-align:left;direction:ltr;unicode-bidi:=
embed">Hi.<u></u><u></u></p>
<p class=3D"MsoNormal" style=3D"text-align:left;direction:ltr;unicode-bidi:=
embed">I&#39;m running a Flink 1.2 job with a 10 seconds checkpoint interva=
l. After some running time (minutes-hours) Flink fails to save checkpoints,=
 and stops processing records (I&#39;m not sure
 if the checkpointing failure is the cause of the problem or just a symptom=
)<span dir=3D"RTL"></span><span lang=3D"HE" dir=3D"RTL" style=3D"font-famil=
y:&quot;Arial&quot;,sans-serif"><span dir=3D"RTL"></span>.</span><span lang=
=3D"HE" dir=3D"RTL" style=3D"font-family:&quot;Arial&quot;,sans-serif"><u><=
/u><u></u></span></p>
<p class=3D"MsoNormal" style=3D"text-align:left;direction:ltr;unicode-bidi:=
embed">After several checkpoints that take some seconds each, they start fa=
iling due to 30 minutes timeout<span dir=3D"RTL"></span><span lang=3D"HE" d=
ir=3D"RTL" style=3D"font-family:&quot;Arial&quot;,sans-serif"><span dir=3D"=
RTL"></span>.</span><u></u><u></u></p>
<p class=3D"MsoNormal" style=3D"text-align:left;direction:ltr;unicode-bidi:=
embed">When I restart one of the Task Manager services (just to get the job=
 restarted), the job is recovered from the last successful checkpoint (the =
state size continues to grow, so it&#39;s
 probably not the reason for the failure), advances somewhat, saves some mo=
re checkpoints, and then enters the failing state again<span dir=3D"RTL"></=
span><span lang=3D"HE" dir=3D"RTL" style=3D"font-family:&quot;Arial&quot;,s=
ans-serif"><span dir=3D"RTL"></span>.</span><u></u><u></u></p>
<p class=3D"MsoNormal" style=3D"text-align:left;direction:ltr;unicode-bidi:=
embed">One of the times it happened, the first failed checkpoint failed due=
 to &quot;Checkpoint Coordinator is suspending.&quot;, so it might be an in=
dicator for the cause of the problem, but looking
 into Flink&#39;s code I can&#39;t see how a running job could get to this =
state.<u></u><u></u></p>
<p class=3D"MsoNormal" style=3D"text-align:left;direction:ltr;unicode-bidi:=
embed">I am using RocksDB for state, and the state is saved to Azure Blob S=
tore, using the NativeAzureFileSystem HDFS connector over the wasbs protoco=
l.<u></u><u></u></p>
<p class=3D"MsoNormal" style=3D"text-align:left;direction:ltr;unicode-bidi:=
embed">Any ideas? Possibly a bug in Flink or RocksDB?<u></u><u></u></p>
</div>


=09
=09
=09
=09<br>
=09<br>
=09<hr noshade size=3D"1" color=3D"#cccccc">
=09<div style=3D"color:#444;font:12px tahoma,geneva,helvetica,arial,sans-se=
rif">
=09=09<div style=3D"font-weight:bold">If you reply to this email, your mess=
age will be added to the discussion below:</div>
=09=09<a href=3D"http://apache-flink-user-mailing-list-archive.2336050.n4.n=
abble.com/Flink-checkpointing-gets-stuck-tp11776.html" target=3D"_blank" re=
l=3D"nofollow" link=3D"external">http://apache-flink-user-<wbr>mailing-list=
-archive.2336050.<wbr>n4.nabble.com/Flink-<wbr>checkpointing-gets-stuck-<wb=
r>tp11776.html</a>
=09</div>
=09<div style=3D"color:#666;font:11px tahoma,geneva,helvetica,arial,sans-se=
rif;margin-top:.4em;line-height:1.5em">
=09=09To start a new topic under Apache Flink User Mailing List archive., e=
mail <a href=3D"/user/SendEmail.jtp?type=3Dnode&node=3D11778&i=3D1" target=
=3D"_top" rel=3D"nofollow" link=3D"external">[hidden email]</a> <br>
=09=09To unsubscribe from Apache Flink User Mailing List archive., <a href=
=3D"" target=3D"_blank" rel=3D"nofollow" link=3D"external">click here</a>.<=
br>
=09=09<a href=3D"http://apache-flink-user-mailing-list-archive.2336050.n4.n=
abble.com/template/NamlServlet.jtp?macro=3Dmacro_viewer&amp;id=3Dinstant_ht=
ml%21nabble%3Aemail.naml&amp;base=3Dnabble.naml.namespaces.BasicNamespace-n=
abble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamesp=
ace&amp;breadcrumbs=3Dnotify_subscribers%21nabble%3Aemail.naml-instant_emai=
ls%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml" rel=3D"n=
ofollow" style=3D"font:9px serif" target=3D"_blank" link=3D"external">NAML<=
/a>
=09</div></blockquote></div><br></div>


=09
=09
=09
<br/><hr align=3D"left" width=3D"300" />
View this message in context: <a href=3D"http://apache-flink-user-mailing-l=
ist-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-tp11776p11=
778.html">Re: Flink checkpointing gets stuck</a><br/>
Sent from the <a href=3D"http://apache-flink-user-mailing-list-archive.2336=
050.n4.nabble.com/">Apache Flink User Mailing List archive. mailing list ar=
chive</a> at Nabble.com.<br/>
------=_Part_46205_546413520.1487685451882--