Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
References: <CA+faj9wuw5zMgdF3+_qucXPAgMBagwCn88E2wu__2fba+ODnhw@mail.gmail.com>
 <a8f32adf-445f-0a3c-24e8-a37ad691669a@tngtech.com> <CA+faj9y1fZkTwyrOc1nrDFq36sodU9zE+Cyfcgp8U3STcZO3gA@mail.gmail.com>
 <4C7E8CF7-18B3-4A4A-834B-AB9AFBD442DB@data-artisans.com> <CA+faj9wfWP8XueVOFSwydvWygTRFLf-OP1mk-O-0HxMD8HNHag@mail.gmail.com>
 <CANC1h_tRDk7pJn26qwjQjmhzq49FJbotVapuosSQnb7JZwdZaA@mail.gmail.com>
In-Reply-To: <CANC1h_tRDk7pJn26qwjQjmhzq49FJbotVapuosSQnb7JZwdZaA@mail.gmail.com>
From: =?UTF-8?Q?Gyula_F=C3=B3ra?= <gyula.fora@gmail.com>
Date: Fri, 14 Jul 2017 07:56:05 +0000
Message-ID: <CA+faj9wCW4r9u27H42T4GfjPjuHj3Fx0mEpWW-tJR7mHyYvFRg@mail.gmail.com>
Subject: Re: Why would a kafka source checkpoint take so long?
To: Stephan Ewen <sewen@apache.org>
Cc: user <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="94eb2c03cd5e3a2877055442627a"
archived-at: Fri, 14 Jul 2017 07:56:21 -0000

--94eb2c03cd5e3a2877055442627a
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi,

I have seen this again yesterday, now with some logging it looks like
acquiring the lock took all the time. In this case it was pretty clear that
the job started falling behind a few minutes before starting the checkpoint
so backpressure seems to be the culprit.

Thanks,
Gyula

Stephan Ewen <sewen@apache.org> ezt =C3=ADrta (id=C5=91pont: 2017. j=C3=BAl=
. 12., Sze,
15:27):

> Can it be that the checkpoint thread is waiting to grab the lock, which i=
s
> held by the chain under backpressure?
>
> On Wed, Jul 12, 2017 at 12:23 PM, Gyula F=C3=B3ra <gyula.fora@gmail.com> =
wrote:
>
>> Yes thats definitely what I am about to do next but just thought maybe
>> someone has seen this before.
>>
>> Will post info next time it happens. (Not guaranteed to happen soon as i=
t
>> didn't happen for a long time before)
>>
>> Gyula
>>
>> On Wed, Jul 12, 2017, 12:13 Stefan Richter <s.richter@data-artisans.com>
>> wrote:
>>
>>> Hi,
>>>
>>> could you introduce some logging to figure out from which method call
>>> the delay is introduced?
>>>
>>> Best,
>>> Stefan
>>>
>>> Am 12.07.2017 um 11:37 schrieb Gyula F=C3=B3ra <gyula.fora@gmail.com>:
>>>
>>> Hi,
>>>
>>> We are using the latest 1.3.1
>>>
>>> Gyula
>>>
>>> Urs Schoenenberger <urs.schoenenberger@tngtech.com> ezt =C3=ADrta (id=
=C5=91pont:
>>> 2017. j=C3=BAl. 12., Sze, 10:44):
>>>
>>>> Hi Gyula,
>>>>
>>>> I don't know the cause unfortunately, but we observed a similiar issue
>>>> on Flink 1.1.3. The problem seems to be gone after upgrading to 1.2.1.
>>>> Which version are you running on?
>>>>
>>>> Urs
>>>>
>>>> On 12.07.2017 09:48, Gyula F=C3=B3ra wrote:
>>>> > Hi,
>>>> >
>>>> > I have noticed a strange behavior in one of our jobs: every once in =
a
>>>> while
>>>> > the Kafka source checkpointing time becomes extremely large compared
>>>> to
>>>> > what it usually is. (To be very specific it is a kafka source chaine=
d
>>>> with
>>>> > a stateless map operator)
>>>> >
>>>> > To be more specific checkpointing the offsets usually takes around
>>>> 10ms
>>>> > which sounds reasonable but in some checkpoints this goes into the 3=
-5
>>>> > minutes range practically blocking the job for that period of time.
>>>> > Yesterday I have observed even 10 minute delays. First I thought tha=
t
>>>> some
>>>> > sources might trigger checkpoints later than others, but adding some
>>>> > logging and comparing it it seems that the triggerCheckpoint was
>>>> received
>>>> > at the same time.
>>>> >
>>>> > Interestingly only one of the 3 kafka sources in the job seems to be
>>>> > affected (last time I checked at least). We are still using the 0.8
>>>> > consumer with commit on checkpoints. Also I dont see this happen in
>>>> other
>>>> > jobs.
>>>> >
>>>> > Any clue on what might cause this?
>>>> >
>>>> > Thanks :)
>>>> > Gyula
>>>> >
>>>> >
>>>> >
>>>> > Hi,
>>>> >
>>>> > I have noticed a strange behavior in one of our jobs: every once in =
a
>>>> > while the Kafka source checkpointing time becomes extremely large
>>>> > compared to what it usually is. (To be very specific it is a kafka
>>>> > source chained with a stateless map operator)
>>>> >
>>>> > To be more specific checkpointing the offsets usually takes around
>>>> 10ms
>>>> > which sounds reasonable but in some checkpoints this goes into the 3=
-5
>>>> > minutes range practically blocking the job for that period of time.
>>>> > Yesterday I have observed even 10 minute delays. First I thought tha=
t
>>>> > some sources might trigger checkpoints later than others, but adding
>>>> > some logging and comparing it it seems that the triggerCheckpoint wa=
s
>>>> > received at the same time.
>>>> >
>>>> > Interestingly only one of the 3 kafka sources in the job seems to be
>>>> > affected (last time I checked at least). We are still using the 0.8
>>>> > consumer with commit on checkpoints. Also I dont see this happen in
>>>> > other jobs.
>>>> >
>>>> > Any clue on what might cause this?
>>>> >
>>>> > Thanks :)
>>>> > Gyula
>>>>
>>>> --
>>>> Urs Sch=C3=B6nenberger - urs.schoenenberger@tngtech.com
>>>>
>>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring
>>>> Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Robert=
 Dahlke
>>>> Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082
>>>>
>>>
>>>
>

--94eb2c03cd5e3a2877055442627a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div>I have seen this again yesterday, n=
ow with some logging it looks like acquiring the lock took all the time. In=
 this case it was pretty clear that the job started falling behind a few mi=
nutes before starting the checkpoint so backpressure seems to be the culpri=
t.=C2=A0</div><div><br></div><div>Thanks,</div><div>Gyula<br><br><div class=
=3D"gmail_quote"><div dir=3D"ltr">Stephan Ewen &lt;<a href=3D"mailto:sewen@=
apache.org">sewen@apache.org</a>&gt; ezt =C3=ADrta (id=C5=91pont: 2017. j=
=C3=BAl. 12., Sze, 15:27):<br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">Can it be that the checkpoint thread is waiting to grab the lock, =
which is held by the chain under backpressure?</div><div dir=3D"ltr"><div c=
lass=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Jul 12, 2017 at=
 12:23 PM, Gyula F=C3=B3ra <span dir=3D"ltr">&lt;<a href=3D"mailto:gyula.fo=
ra@gmail.com" target=3D"_blank">gyula.fora@gmail.com</a>&gt;</span> wrote:<=
br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left=
:1px #ccc solid;padding-left:1ex"><p dir=3D"ltr">Yes thats definitely what =
I am about to do next but just thought maybe someone has seen this before.<=
/p>
<p dir=3D"ltr">Will post info next time it happens. (Not guaranteed to happ=
en soon as it didn&#39;t happen for a long time before)</p><span class=3D"m=
_-7541928781895611858HOEnZb"><font color=3D"#888888">
<p dir=3D"ltr">Gyula </p></font></span><div class=3D"m_-7541928781895611858=
HOEnZb"><div class=3D"m_-7541928781895611858h5">
<br><div class=3D"gmail_quote"><div dir=3D"ltr">On Wed, Jul 12, 2017, 12:13=
 Stefan Richter &lt;<a href=3D"mailto:s.richter@data-artisans.com" target=
=3D"_blank">s.richter@data-artisans.com</a>&gt; wrote:<br></div><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex"><div style=3D"word-wrap:break-word">Hi,<div><br></div><=
div>could you introduce some logging to figure out from which method call t=
he delay is introduced?</div><div><br></div><div>Best,</div><div>Stefan</di=
v></div><div style=3D"word-wrap:break-word"><div><br><div><blockquote type=
=3D"cite"><div>Am 12.07.2017 um 11:37 schrieb Gyula F=C3=B3ra &lt;<a href=
=3D"mailto:gyula.fora@gmail.com" target=3D"_blank">gyula.fora@gmail.com</a>=
&gt;:</div><br class=3D"m_-7541928781895611858m_-543391521692745857m_110351=
7930406592576Apple-interchange-newline"><div><div dir=3D"ltr">Hi,<div><br><=
/div><div>We are using the latest 1.3.1</div><div><br></div><div>Gyula</div=
></div><br><div class=3D"gmail_quote"><div dir=3D"ltr">Urs Schoenenberger &=
lt;<a href=3D"mailto:urs.schoenenberger@tngtech.com" target=3D"_blank">urs.=
schoenenberger@tngtech.com</a>&gt; ezt =C3=ADrta (id=C5=91pont: 2017. j=C3=
=BAl. 12., Sze, 10:44):<br></div><blockquote class=3D"gmail_quote" style=3D=
"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Gyula,<b=
r>
<br>
I don&#39;t know the cause unfortunately, but we observed a similiar issue<=
br>
on Flink 1.1.3. The problem seems to be gone after upgrading to 1.2.1.<br>
Which version are you running on?<br>
<br>
Urs<br>
<br>
On 12.07.2017 09:48, Gyula F=C3=B3ra wrote:<br>
&gt; Hi,<br>
&gt;<br>
&gt; I have noticed a strange behavior in one of our jobs: every once in a =
while<br>
&gt; the Kafka source checkpointing time becomes extremely large compared t=
o<br>
&gt; what it usually is. (To be very specific it is a kafka source chained =
with<br>
&gt; a stateless map operator)<br>
&gt;<br>
&gt; To be more specific checkpointing the offsets usually takes around 10m=
s<br>
&gt; which sounds reasonable but in some checkpoints this goes into the 3-5=
<br>
&gt; minutes range practically blocking the job for that period of time.<br=
>
&gt; Yesterday I have observed even 10 minute delays. First I thought that =
some<br>
&gt; sources might trigger checkpoints later than others, but adding some<b=
r>
&gt; logging and comparing it it seems that the triggerCheckpoint was recei=
ved<br>
&gt; at the same time.<br>
&gt;<br>
&gt; Interestingly only one of the 3 kafka sources in the job seems to be<b=
r>
&gt; affected (last time I checked at least). We are still using the 0.8<br=
>
&gt; consumer with commit on checkpoints. Also I dont see this happen in ot=
her<br>
&gt; jobs.<br>
&gt;<br>
&gt; Any clue on what might cause this?<br>
&gt;<br>
&gt; Thanks :)<br>
&gt; Gyula<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Hi,<br>
&gt;<br>
&gt; I have noticed a strange behavior in one of our jobs: every once in a<=
br>
&gt; while the Kafka source checkpointing time becomes extremely large<br>
&gt; compared to what it usually is. (To be very specific it is a kafka<br>
&gt; source chained with a stateless map operator)<br>
&gt;<br>
&gt; To be more specific checkpointing the offsets usually takes around 10m=
s<br>
&gt; which sounds reasonable but in some checkpoints this goes into the 3-5=
<br>
&gt; minutes range practically blocking the job for that period of time.<br=
>
&gt; Yesterday I have observed even 10 minute delays. First I thought that<=
br>
&gt; some sources might trigger checkpoints later than others, but adding<b=
r>
&gt; some logging and comparing it it seems that the triggerCheckpoint was<=
br>
&gt; received at the same time.<br>
&gt;<br>
&gt; Interestingly only one of the 3 kafka sources in the job seems to be<b=
r>
&gt; affected (last time I checked at least). We are still using the 0.8<br=
>
&gt; consumer with commit on checkpoints. Also I dont see this happen in<br=
>
&gt; other jobs.<br>
&gt;<br>
&gt; Any clue on what might cause this?<br>
&gt;<br>
&gt; Thanks :)<br>
&gt; Gyula<br>
<br>
--<br>
Urs Sch=C3=B6nenberger - <a href=3D"mailto:urs.schoenenberger@tngtech.com" =
target=3D"_blank">urs.schoenenberger@tngtech.com</a><br>
<br>
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring<br>
Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Robert Dahl=
ke<br>
Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082<br>
</blockquote></div>
</div></blockquote></div><br></div></div></blockquote></div>
</div></div></blockquote></div><br></div></div></blockquote></div></div></d=
iv>

--94eb2c03cd5e3a2877055442627a--