Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
From: Stefan Richter <s.richter@data-artisans.com>
Message-Id: <7BB17283-BD94-42D6-977A-F1FDACFEE9FA@data-artisans.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_A84A07F5-7FE0-47F0-BE1A-96165A7C7217"
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
Subject: Re: Stream Task seems to be blocked after checkpoint timeout
Date: Thu, 28 Sep 2017 13:43:44 +0200
In-Reply-To: <CAKhqdDxkzvcZeR8PYpEqOia06N0uXMQ6DOnbFERw52rPZxdDPA@mail.gmail.com>
Cc: user <user@flink.apache.org>,
 Stephan Ewen <stephan@data-artisans.com>
To: Tony Wei <tony19920430@gmail.com>
References: <CAKhqdDxEq0+smeQA=+VDtBsG4Noaq+kTUqZ8Ot+z+9yDdFMbhQ@mail.gmail.com>
 <A0EB1762-C285-4EB4-A46E-90778FA3D92A@data-artisans.com>
 <CAKhqdDzh-YTg8VEF8rfhxDKkzaDp64KdRVJZeMrfaOq5tJj60w@mail.gmail.com>
 <3C2BE3A4-9790-47A2-A81C-C5C66BB51841@data-artisans.com>
 <CAKhqdDxSBtvps-3F8U2Qn+yu4w2QjjM5-dGA6oLn_wEwa+OAZQ@mail.gmail.com>
 <80FDB464-82B0-4B9B-970B-37C4A069171B@data-artisans.com>
 <CAKhqdDzeX+=eB6v-4pnSrhSFcV0dX3cD-DhDRqEJZ9E756TK7A@mail.gmail.com>
 <CAKhqdDzuPtSDD9sG6-2mM_ceEHhKkrN_uiUXLe90m5W-5PRscQ@mail.gmail.com>
 <7A3FC617-953C-469F-AB73-FD1C9A4B23A4@data-artisans.com>
 <CAKhqdDzO4nUzEBnj5Razxk6iL-hjVV_-PWRCApSTQie0307SKA@mail.gmail.com>
 <BE0DF379-5A6E-4263-B2B2-125BA235C2E0@data-artisans.com>
 <CAKhqdDz2Nym5hrkqT1B4EUd88HrCh25ikyog98BTscsSLTpTVA@mail.gmail.com>
 <CAKhqdDxkzvcZeR8PYpEqOia06N0uXMQ6DOnbFERw52rPZxdDPA@mail.gmail.com>
archived-at: Thu, 28 Sep 2017 11:44:05 -0000


--Apple-Mail=_A84A07F5-7FE0-47F0-BE1A-96165A7C7217
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hi,

the gap between the sync and the async part does not mean too much. What =
happens per task is that all operators go through their sync part, and =
then one thread executes all the async parts, one after the other. So if =
an async part starts late, this is just because it started only after =
another async part finished.

I have one more question about your job, because it involves =
communication with external systems, like S3 and a database. Are you =
sure that they cannot sometimes become a bottleneck, block, and bring =
down your job. in particular: is the same S3 used to serve the operator =
and checkpointing and what is your sustained read/write rate there and =
the maximum number of connections? You can try to use the backpressure =
metric and try to identify the first operator (counting from the sink) =
that indicates backpressure.

Best,
Stefan

> Am 28.09.2017 um 12:59 schrieb Tony Wei <tony19920430@gmail.com>:
>=20
> Hi,
>=20
> Sorry. This is the correct one.
>=20
> Best Regards,
> Tony Wei
>=20
> 2017-09-28 18:55 GMT+08:00 Tony Wei <tony19920430@gmail.com =
<mailto:tony19920430@gmail.com>>:
> Hi Stefan,=20
>=20
> Sorry for providing partial information. The attachment is the full =
logs for checkpoint #1577.
>=20
> Why I would say it seems that asynchronous part was not executed =
immediately is due to all synchronous parts were all finished at =
2017-09-27 13:49.
> Did that mean the checkpoint barrier event had already arrived at the =
operator and started as soon as when the JM triggered the checkpoint?
>=20
> Best Regards,
> Tony Wei
>=20
> 2017-09-28 18:22 GMT+08:00 Stefan Richter <s.richter@data-artisans.com =
<mailto:s.richter@data-artisans.com>>:
> Hi,
>=20
> I agree that the memory consumption looks good. If there is only one =
TM, it will run inside one JVM. As for the 7 minutes, you mean the =
reported end-to-end time? This time measurement starts when the =
checkpoint is triggered on the job manager, the first contributor is =
then the time that it takes for the checkpoint barrier event to travel =
with the stream to the operators. If there is back pressure and a lot of =
events are buffered, this can introduce delay to this first part, =
because barriers must not overtake data for correctness. After the =
barrier arrives at the operator, next comes the synchronous part of the =
checkpoint, which is typically short running and takes a snapshot of the =
state (think of creating an immutable version, e.g. through copy on =
write). In the asynchronous part, this snapshot is persisted to DFS. =
After that the timing stops and is reported together with the =
acknowledgement to the job manager.=20
>=20
> So, I would assume if reporting took 7 minutes end-to-end, and the =
async part took 4 minutes, it is likely that it took around 3 minutes =
for the barrier event to travel with the stream. About the debugging, I =
think it is hard to figure out what is going on with the DFS if you =
don=E2=80=99t have metrics on that. Maybe you could attach a sampler to =
the TM=E2=80=99s jvm and monitor where time is spend for the =
snapshotting?
>=20
> I am also looping in Stephan, he might have more suggestions.
>=20
> Best,
> Stefan
>=20
>> Am 28.09.2017 um 11:25 schrieb Tony Wei <tony19920430@gmail.com =
<mailto:tony19920430@gmail.com>>:
>>=20
>> Hi Stefan,
>>=20
>> These are some telemetry information, but I don't have history =
information about gc.
>>=20
>> <???? 2017-09-2 8 =E4=B8=8B=E5=8D=884.51.26.png>
>> <???? 2017-09-2 8 =E4=B8=8B=E5=8D=884.51.11.png>
>>=20
>> 1) Yes, my state is not large.
>> 2) My DFS is S3, but my cluster is out of AWS. It might be a problem. =
Since this is a POC, we might move to AWS in the future or use HDFS in =
the same cluster. However, how can I recognize the problem is this.
>> 3) It seems memory usage is bounded. I'm not sure if the status =
showed above is fine.
>>=20
>> There is only one TM in my cluster for now, so all tasks are running =
on that machine. I think that means they are in the same JVM, right?
>> Besides taking so long on asynchronous part, there is another =
question is that the late message showed that this task was delay for =
almost 7 minutes, but the log showed it only took 4 minutes.
>> It seems that it was somehow waiting for being executed. Are there =
some points to find out what happened?
>>=20
>> For the log information, what I means is it is hard to recognize =
which checkpoint id that asynchronous parts belong to if the checkpoint =
takes more time and there are more concurrent checkpoints taking place.
>> Also, it seems that asynchronous part might be executed right away if =
there is no resource from thread pool. It is better to measure the time =
between creation time and processing time, and log it and checkpoint id =
with the original log that showed what time the asynchronous part took.
>>=20
>> Best Regards,
>> Tony Wei
>>=20
>> 2017-09-28 16:25 GMT+08:00 Stefan Richter =
<s.richter@data-artisans.com <mailto:s.richter@data-artisans.com>>:
>> Hi,
>>=20
>> when the async part takes that long I would have 3 things to look at:
>>=20
>> 1) Is your state so large? I don=E2=80=99t think this applies in your =
case, right?
>> 2) Is something wrong with writing to DFS (network, disks, etc)?
>> 3) Are we running low on memory on that task manager?
>>=20
>> Do you have telemetry information about used heap and gc pressure on =
the problematic task? However, what speaks against the memory problem =
hypothesis is that future checkpoints seem to go through again. What I =
find very strange is that within the reported 4 minutes of the async =
part the only thing that happens is: open dfs output stream, iterate the =
in-memory state and write serialized state data to dfs stream, then =
close the stream. No locks or waits in that section, so I would assume =
that for one of the three reasons I gave, writing the state is terribly =
slow.
>>=20
>> Those snapshots should be able to run concurrently, for example so =
that users can also take savepoints  even when a checkpoint was =
triggered and is still running, so there is no way to guarantee that the =
previous parts have finished, this is expected behaviour. Which waiting =
times are you missing in the log? I think the information about when a =
checkpoint is triggered, received by the TM, performing the sync and =
async part and acknowledgement time should all be there?.
>>=20
>> Best,
>> Stefan
>>=20
>>=20
>>=20
>>> Am 28.09.2017 um 08:18 schrieb Tony Wei <tony19920430@gmail.com =
<mailto:tony19920430@gmail.com>>:
>>>=20
>>> Hi Stefan,
>>>=20
>>> The checkpoint on my job has been subsumed again. There are some =
questions that I don't understand.
>>>=20
>>> Log in JM :
>>> 2017-09-27 13:45:15,686 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed =
checkpoint 1576 (174693180 bytes in 21597 ms).
>>> 2017-09-27 13:49:42,795 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering =
checkpoint 1577 @ 1506520182795
>>> 2017-09-27 13:54:42,795 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering =
checkpoint 1578 @ 1506520482795
>>> 2017-09-27 13:55:13,105 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed =
checkpoint 1578 (152621410 bytes in 19109 ms).
>>> 2017-09-27 13:56:37,103 WARN =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received =
late message for now expired checkpoint attempt 1577 from =
2273da50f29b9dee731f7bd749e91c80 of job 7c039572b....
>>> 2017-09-27 13:59:42,795 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering =
checkpoint 1579 @ 1506520782795
>>>=20
>>> Log in TM:
>>> 2017-09-27 13:56:37,105 INFO =
org.apache.flink.runtime.state.DefaultOperatorStateBackend - =
DefaultOperatorStateBackend snapshot (File Stream Factory @ =
s3://tony-dev/flink- <>checkpoints/7c039572b13346f1b17dcc0ace2b72c2, =
asynchronous part) in thread Thread[pool-7-thread-322,5,Flink Task =
Threads] took 240248 ms.
>>>=20
>>> I think the log in TM might be the late message for #1577 in JM, =
because #1576, #1578 had been finished and #1579 hadn't been started at =
13:56:37.
>>> If there is no mistake on my words, I am wondering why the time it =
took was 240248 ms (4 min). It seems that it started late than =
asynchronous tasks in #1578.
>>> Is there any way to guarantee the previous asynchronous parts of =
checkpoints will be executed before the following.
>>>=20
>>> Moreover, I think it will be better to have more information in INFO =
log, such as waiting time and checkpoint id, in order to trace the =
progress of checkpoint conveniently.
>>>=20
>>> What do you think? Do you have any suggestion for me to deal with =
these problems? Thank you.
>>>=20
>>> Best Regards,
>>> Tony Wei
>>>=20
>>> 2017-09-27 17:11 GMT+08:00 Tony Wei <tony19920430@gmail.com =
<mailto:tony19920430@gmail.com>>:
>>> Hi Stefan,
>>>=20
>>> Here is the summary for my streaming job's checkpoint after =
restarting at last night.
>>>=20
>>> <???? 2017-09-2 7 =E4=B8=8B=E5=8D=884.56.30.png>
>>>=20
>>> This is the distribution of alignment buffered from the last 12 =
hours.
>>>=20
>>> <???? 2017-09-2 7 =E4=B8=8B=E5=8D=885.05.11.png>
>>>=20
>>> And here is the buffer out pool usage during chk #1140 ~ #1142. For =
chk #1245 and #1246, you can check the picture I sent before.
>>>=20
>>>  <???? 2017-09-2 7 =E4=B8=8B=E5=8D=885.01.24.png>
>>>=20
>>> AFAIK, the back pressure rate usually is in LOW status, sometimes =
goes up to HIGH, and always OK during the night.
>>>=20
>>> Best Regards,
>>> Tony Wei
>>>=20
>>>=20
>>> 2017-09-27 16:54 GMT+08:00 Stefan Richter =
<s.richter@data-artisans.com <mailto:s.richter@data-artisans.com>>:
>>> Hi Tony,
>>>=20
>>> are your checkpoints typically close to the timeout boundary? =46rom =
what I see, writing the checkpoint is relatively fast but the time from =
the checkpoint trigger to execution seems very long. This is typically =
the case if your job has a lot of backpressure and therefore the =
checkpoint barriers take a long time to travel to the operators, because =
a lot of events are piling up in the buffers. Do you also experience =
large alignments for your checkpoints?
>>>=20
>>> Best,
>>> Stefan =20
>>>=20
>>>> Am 27.09.2017 um 10:43 schrieb Tony Wei <tony19920430@gmail.com =
<mailto:tony19920430@gmail.com>>:
>>>>=20
>>>> Hi Stefan,
>>>>=20
>>>> It seems that I found something strange from JM's log.
>>>>=20
>>>> It had happened more than once before, but all subtasks would =
finish their checkpoint attempts in the end.
>>>>=20
>>>> 2017-09-26 01:23:28,690 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering =
checkpoint 1140 @ 1506389008690
>>>> 2017-09-26 01:28:28,690 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering =
checkpoint 1141 @ 1506389308690
>>>> 2017-09-26 01:33:28,690 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering =
checkpoint 1142 @ 1506389608690
>>>> 2017-09-26 01:33:28,691 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint =
1140 expired before completing.
>>>> 2017-09-26 01:38:28,691 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint =
1141 expired before completing.
>>>> 2017-09-26 01:40:38,044 WARN =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received =
late message for now expired checkpoint attempt 1140 from =
c63825d15de0fef55a1d148adcf4467e of job 7c039572b...
>>>> 2017-09-26 01:40:53,743 WARN =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received =
late message for now expired checkpoint attempt 1141 from =
c63825d15de0fef55a1d148adcf4467e of job 7c039572b...
>>>> 2017-09-26 01:41:19,332 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed =
checkpoint 1142 (136733704 bytes in 457413 ms).
>>>>=20
>>>> For chk #1245 and #1246, there was no late message from TM. You can =
refer to the TM log. The full completed checkpoint attempt will have 12 =
(... asynchronous part) logs in general, but #1245 and #1246 only got 10 =
logs.
>>>>=20
>>>> 2017-09-26 10:08:28,690 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering =
checkpoint 1245 @ 1506420508690
>>>> 2017-09-26 10:13:28,690 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering =
checkpoint 1246 @ 1506420808690
>>>> 2017-09-26 10:18:28,691 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint =
1245 expired before completing.
>>>> 2017-09-26 10:23:28,691 INFO =
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint =
1246 expired before completing.
>>>>=20
>>>> Moreover, I listed the directory for checkpoints on S3 and saw =
there were two states not discarded successfully. In general, there will =
be 16 parts for a completed checkpoint state.
>>>>=20
>>>> 2017-09-26 18:08:33 36919 =
tony-dev/flink-checkpoints/7c039572b13346f1b17dcc0ace2b72c2/chk-1245/eedd7=
ca5-ee34-45a5-bf0b-11cc1fc67ab8
>>>> 2017-09-26 18:13:34 37419 =
tony-dev/flink-checkpoints/7c039572b13346f1b17dcc0ace2b72c2/chk-1246/9aa5c=
6c4-8c74-465d-8509-5fea4ed25af6
>>>>=20
>>>> Hope these informations are helpful. Thank you.
>>>>=20
>>>> Best Regards,
>>>> Tony Wei
>>>>=20
>>>> 2017-09-27 16:14 GMT+08:00 Stefan Richter =
<s.richter@data-artisans.com <mailto:s.richter@data-artisans.com>>:
>>>> Hi,
>>>>=20
>>>> thanks for the information. Unfortunately, I have no immediate idea =
what the reason is from the given information. I think most helpful =
could be a thread dump, but also metrics on the operator operator level =
to figure out which part of the pipeline is the culprit.
>>>>=20
>>>> Best,
>>>> Stefan
>>>>=20
>>>>> Am 26.09.2017 um 17:55 schrieb Tony Wei <tony19920430@gmail.com =
<mailto:tony19920430@gmail.com>>:
>>>>>=20
>>>>> Hi Stefan,
>>>>>=20
>>>>> There is no unknown exception in my full log. The Flink version is =
1.3.2.
>>>>> My job is roughly like this.
>>>>>=20
>>>>> env.addSource(Kafka)
>>>>>   .map(ParseKeyFromRecord)
>>>>>   .keyBy()
>>>>>   .process(CountAndTimeoutWindow)
>>>>>   .asyncIO(UploadToS3)
>>>>>   .addSink(UpdateDatabase)
>>>>>=20
>>>>> It seemed all tasks stopped like the picture I sent in the last =
email.
>>>>>=20
>>>>> I will keep my eye on taking a thread dump from that JVM if this =
happens again.
>>>>>=20
>>>>> Best Regards,
>>>>> Tony Wei
>>>>>=20
>>>>> 2017-09-26 23:46 GMT+08:00 Stefan Richter =
<s.richter@data-artisans.com <mailto:s.richter@data-artisans.com>>:
>>>>> Hi,
>>>>>=20
>>>>> that is very strange indeed. I had a look at the logs and there is =
no error or exception reported. I assume there is also no exception in =
your full logs? Which version of flink are you using and what operators =
were running in the task that stopped? If this happens again, would it =
be possible to take a thread dump from that JVM?
>>>>>=20
>>>>> Best,
>>>>> Stefan
>>>>>=20
>>>>> > Am 26.09.2017 um 17:08 schrieb Tony Wei <tony19920430@gmail.com =
<mailto:tony19920430@gmail.com>>:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > Something weird happened on my streaming job.
>>>>> >
>>>>> > I found my streaming job seems to be blocked for a long time and =
I saw the situation like the picture below. (chk #1245 and #1246 were =
all finishing 7/8 tasks then marked timeout by JM. Other checkpoints =
failed with the same state like #1247 util I restarted TM.)
>>>>> >
>>>>> > <snapshot.png>
>>>>> >
>>>>> > I'm not sure what happened, but the consumer stopped fetching =
records, buffer usage is 100% and the following task did not seem to =
fetch data anymore. Just like the whole TM was stopped.
>>>>> >
>>>>> > However, after I restarted TM and force the job restarting from =
the latest completed checkpoint, everything worked again. And I don't =
know how to reproduce it.
>>>>> >
>>>>> > The attachment is my TM log. Because there are many user logs =
and sensitive information, I only remain the log from =
`org.apache.flink...`.
>>>>> >
>>>>> > My cluster setting is one JM and one TM with 4 available slots.
>>>>> >
>>>>> > Streaming job uses all slots, checkpoint interval is 5 mins and =
max concurrent number is 3.
>>>>> >
>>>>> > Please let me know if it needs more information to find out what =
happened on my streaming job. Thanks for your help.
>>>>> >
>>>>> > Best Regards,
>>>>> > Tony Wei
>>>>> > <flink-root-taskmanager-0-partial.log>
>>>>>=20
>>>>>=20
>>>>=20
>>>>=20
>>>=20
>>>=20
>>>=20
>>=20
>>=20
>=20
>=20
>=20
> <chk_ 1577.log>


--Apple-Mail=_A84A07F5-7FE0-47F0-BE1A-96165A7C7217
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div class=3D"">Hi,</div><div class=3D""><br =
class=3D""></div><div class=3D"">the gap between the sync and the async =
part does not mean too much. What happens per task is that all operators =
go through their sync part, and then one thread executes all the async =
parts, one after the other. So if an async part starts late, this is =
just because it started only after another async part =
finished.</div><div class=3D""><br class=3D""></div><div class=3D"">I =
have one more question about your job, because it involves communication =
with external systems, like S3 and a database. Are you sure that they =
cannot sometimes become a bottleneck, block, and bring down your job. in =
particular: is the same S3 used to serve the operator and checkpointing =
and what is your sustained read/write rate there and the maximum number =
of connections? You can try to use the backpressure metric and try to =
identify the first operator (counting from the sink) that indicates =
backpressure.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Best,</div><div class=3D"">Stefan</div><br =
class=3D""><div><blockquote type=3D"cite" class=3D""><div class=3D"">Am =
28.09.2017 um 12:59 schrieb Tony Wei &lt;<a =
href=3D"mailto:tony19920430@gmail.com" =
class=3D"">tony19920430@gmail.com</a>&gt;:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D"">Hi,<div class=3D""><br class=3D""></div><div class=3D"">Sorry. =
This is the correct one.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Best Regards,</div><div class=3D"">Tony Wei</div></div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">2017-09-28=
 18:55 GMT+08:00 Tony Wei <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:tony19920430@gmail.com" target=3D"_blank" =
class=3D"">tony19920430@gmail.com</a>&gt;</span>:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr" =
class=3D"">Hi Stefan,&nbsp;<div class=3D""><br class=3D""></div><div =
class=3D"">Sorry for providing partial information. The attachment is =
the full logs for checkpoint #1577.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Why I would say it seems that =
asynchronous part was not executed immediately is due to all synchronous =
parts were all finished at&nbsp;2017-09-27 13:49.</div><div class=3D"">Did=
 that mean the checkpoint barrier event had already arrived at the =
operator and started as soon as when the JM triggered the =
checkpoint?</div><div class=3D""><br class=3D""></div><div class=3D"">Best=
 Regards,</div><div class=3D"">Tony Wei</div></div><div =
class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br =
class=3D""><div class=3D"gmail_quote">2017-09-28 18:22 GMT+08:00 Stefan =
Richter <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:s.richter@data-artisans.com" target=3D"_blank" =
class=3D"">s.richter@data-artisans.com</a>&gt;</span>:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi,<div class=3D""><br =
class=3D""></div><div class=3D"">I agree that the memory consumption =
looks good. If there is only one TM, it will run inside one JVM. As for =
the 7 minutes, you mean the reported end-to-end time? This time =
measurement starts when the checkpoint is triggered on the job manager, =
the first contributor is then the time that it takes for the checkpoint =
barrier event to travel with the stream to the operators. If there is =
back pressure and a lot of events are buffered, this can introduce delay =
to this first part, because barriers must not overtake data for =
correctness. After the barrier arrives at the operator, next comes the =
synchronous part of the checkpoint, which is typically short running and =
takes a snapshot of the state (think of creating an immutable version, =
e.g. through copy on write). In the asynchronous part, this snapshot is =
persisted to DFS. After that the timing stops and is reported together =
with the acknowledgement to the job manager.&nbsp;</div><div =
class=3D""><br class=3D""></div><div class=3D"">So, I would assume if =
reporting took 7 minutes end-to-end, and the async part took 4 minutes, =
it is likely that it took around 3 minutes for the barrier event to =
travel with the stream. About the debugging, I think it is hard to =
figure out what is going on with the DFS if you don=E2=80=99t have =
metrics on that. Maybe you could attach a sampler to the TM=E2=80=99s =
jvm and monitor where time is spend for the snapshotting?</div><div =
class=3D""><br class=3D""></div><div class=3D"">I am also looping in =
Stephan, he might have more suggestions.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Best,</div><div =
class=3D"">Stefan</div><div class=3D""><br class=3D""><div =
class=3D""><blockquote type=3D"cite" class=3D""><span class=3D""><div =
class=3D"">Am 28.09.2017 um 11:25 schrieb Tony Wei &lt;<a =
href=3D"mailto:tony19920430@gmail.com" target=3D"_blank" =
class=3D"">tony19920430@gmail.com</a>&gt;:</div><br =
class=3D"m_-1992059786246775598m_-4092727656205936872Apple-interchange-new=
line"></span><div class=3D""><div dir=3D"ltr" class=3D"">Hi Stefan,<div =
class=3D""><br class=3D""></div><span class=3D""><div class=3D"">These =
are some telemetry information, but I don't have history information =
about gc.</div><div class=3D""><br class=3D""></div></span><div =
class=3D""><span =
id=3D"m_-1992059786246775598m_-4092727656205936872cid:ii_15ec7b22ca6f854e"=
 class=3D"">&lt;???? 2017-09-2 8 =E4=B8=8B=E5=8D=884.51.26.png&gt;</span><=
/div><div class=3D""><span =
id=3D"m_-1992059786246775598m_-4092727656205936872cid:ii_15ec7b2223936fc5"=
 class=3D"">&lt;???? 2017-09-2 8 =E4=B8=8B=E5=8D=884.51.11.png&gt;</span><=
br class=3D""></div><div class=3D""><div =
class=3D"m_-1992059786246775598h5"><div class=3D""><br =
class=3D""></div><div class=3D"">1) Yes, my state is not =
large.</div><div class=3D"">2) My DFS is S3, but my cluster is out of =
AWS. It might be a problem. Since this is a POC, we might move to AWS in =
the future or use HDFS in the same cluster. However, how can =
I&nbsp;recognize the problem is this.</div><div class=3D"">3) It seems =
memory usage is bounded. I'm not sure if the status showed above is =
fine.</div><div class=3D""><br class=3D""></div><div class=3D"">There is =
only one TM in my cluster for now, so all tasks are running on that =
machine. I think that means they are in the same JVM, right?<br =
class=3D""></div><div class=3D"">Besides taking so long on asynchronous =
part, there is another question is that the late message showed that =
this task was delay for almost 7 minutes, but the log showed it only =
took 4 minutes.</div><div class=3D"">It seems that it was somehow =
waiting for being executed. Are there some points to find out what =
happened?</div><div class=3D""><br class=3D""></div><div class=3D"">For =
the log information, what I means is it is hard to recognize which =
checkpoint id that asynchronous parts belong to if the checkpoint takes =
more time and there are more concurrent checkpoints taking =
place.</div><div class=3D"">Also, it seems that asynchronous part might =
be executed right away if there is no resource from thread pool. It is =
better to measure the time between creation time and processing time, =
and log it and checkpoint id with the original log that showed what time =
the asynchronous part took.</div><div class=3D""><br class=3D""></div><div=
 class=3D"">Best Regards,</div><div class=3D"">Tony =
Wei</div></div></div></div><div class=3D""><div =
class=3D"m_-1992059786246775598h5"><div class=3D"gmail_extra"><br =
class=3D""><div class=3D"gmail_quote">2017-09-28 16:25 GMT+08:00 Stefan =
Richter <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:s.richter@data-artisans.com" target=3D"_blank" =
class=3D"">s.richter@data-artisans.com</a>&gt;</span>:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi,<div class=3D""><br =
class=3D""></div><div class=3D"">when the async part takes that long I =
would have 3 things to look at:</div><div class=3D""><br =
class=3D""></div><div class=3D"">1) Is your state so large? I don=E2=80=99=
t think this applies in your case, right?</div><div class=3D"">2) Is =
something wrong with writing to DFS (network, disks, etc)?</div><div =
class=3D"">3) Are we running low on memory on that task =
manager?</div><div class=3D""><br class=3D""></div><div class=3D"">Do =
you have telemetry information about used heap and gc pressure on the =
problematic task? However, what speaks against the memory problem =
hypothesis is that future checkpoints seem to go through again. What I =
find very strange is that within the reported 4 minutes of the async =
part the only thing that happens is: open dfs output stream, iterate the =
in-memory state and write serialized state data to dfs stream, then =
close the stream. No locks or waits in that section, so I would assume =
that for one of the three reasons I gave, writing the state is terribly =
slow.</div><div class=3D""><br class=3D""></div><div class=3D"">Those =
snapshots should be able to run concurrently, for example so that users =
can also take savepoints &nbsp;even when a checkpoint was triggered and =
is still running, so there is no way to guarantee that the previous =
parts have finished, this is expected behaviour. Which waiting times are =
you missing in the log? I think the information about when a checkpoint =
is triggered, received by the TM, performing the sync and async part and =
acknowledgement time should all be there?.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Best,</div><div =
class=3D"">Stefan</div><div class=3D""><br class=3D""></div><div =
class=3D""><br class=3D""></div><div class=3D""><br class=3D""><div =
class=3D""><blockquote type=3D"cite" class=3D""><span class=3D""><div =
class=3D"">Am 28.09.2017 um 08:18 schrieb Tony Wei &lt;<a =
href=3D"mailto:tony19920430@gmail.com" target=3D"_blank" =
class=3D"">tony19920430@gmail.com</a>&gt;:</div><br =
class=3D"m_-1992059786246775598m_-4092727656205936872m_-432921396052215665=
0Apple-interchange-newline"></span><div class=3D""><div dir=3D"ltr" =
class=3D"">Hi Stefan,<div class=3D""><br class=3D""></div><span =
class=3D""><div class=3D"">The checkpoint on my job has =
been&nbsp;subsumed again. There are some questions that I don't =
understand.</div><div class=3D""><br class=3D""></div><div class=3D"">Log =
in JM :</div><div class=3D"">2017-09-27 13:45:15,686 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Completed checkpoint 1576 (174693180 bytes in 21597 ms).<br =
class=3D""></div><div class=3D"">2017-09-27 13:49:42,795 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Triggering checkpoint 1577 @ 1506520182795</div><div =
class=3D"">2017-09-27 13:54:42,795 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Triggering checkpoint 1578 @ 1506520482795</div><div =
class=3D"">2017-09-27 13:55:13,105 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Completed checkpoint 1578 (152621410 bytes in 19109 ms).</div><div =
class=3D"">2017-09-27 13:56:37,103 WARN =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Received late message for now expired checkpoint attempt 1577 from =
2273da50f29b9dee731f7bd749e91c<wbr class=3D"">80 of job 7c039572b....<br =
class=3D""></div><div class=3D"">2017-09-27 13:59:42,795 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Triggering checkpoint 1579 @ 1506520782795<br class=3D""></div><div =
class=3D""><br class=3D""></div><div class=3D"">Log in TM:</div><div =
class=3D"">2017-09-27 13:56:37,105 INFO =
org.apache.flink.runtime.state<wbr class=3D"">.DefaultOperatorStateBackend=
 - DefaultOperatorStateBackend snapshot (File Stream Factory @ <a =
class=3D"">s3://tony-dev/flink-</a>checkpoint<wbr =
class=3D"">s/7c039572b13346f1b17dcc0ace2b<wbr class=3D"">72c2, =
asynchronous part) in thread Thread[pool-7-thread-322,5,Fli<wbr =
class=3D"">nk Task Threads] took 240248 ms.<br class=3D""></div><div =
class=3D""><br class=3D""></div><div class=3D"">I think the log in TM =
might be the late message for #1577 in JM, because #1576, #1578 had been =
finished and #1579 hadn't been started at 13:56:37.</div><div =
class=3D"">If there is no mistake on my words, I am wondering why the =
time it took was 240248 ms (4 min). It seems that it started late than =
asynchronous tasks in #1578.</div><div class=3D"">Is there any way =
to&nbsp;guarantee the previous asynchronous parts of checkpoints will be =
executed before the following.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Moreover, I think it will be better to =
have more information in INFO log, such as waiting time and checkpoint =
id, in order to trace the progress of checkpoint conveniently.</div><div =
class=3D""><br class=3D""></div><div class=3D"">What do you think? Do =
you have any suggestion for me to deal with these problems? Thank =
you.</div><div class=3D""><br class=3D""></div><div class=3D"">Best =
Regards,</div><div class=3D"">Tony Wei</div></span></div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote"><span =
class=3D"">2017-09-27 17:11 GMT+08:00 Tony Wei <span dir=3D"ltr" =
class=3D"">&lt;<a href=3D"mailto:tony19920430@gmail.com" target=3D"_blank"=
 class=3D"">tony19920430@gmail.com</a>&gt;</span>:<br =
class=3D""></span><blockquote class=3D"gmail_quote" style=3D"margin:0 0 =
0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr" =
class=3D""><span class=3D"">Hi Stefan,<div class=3D""><br =
class=3D""></div><div class=3D"">Here is the summary for my streaming =
job's checkpoint after restarting at last night.</div><div class=3D""><br =
class=3D""></div></span><div class=3D""><span =
id=3D"m_-1992059786246775598m_-4092727656205936872m_-4329213960522156650ci=
d:ii_15ec28dcb79974fe" class=3D"">&lt;???? 2017-09-2 7 =
=E4=B8=8B=E5=8D=884.56.30.png&gt;</span></div><span class=3D""><div =
class=3D""><br class=3D""></div><div class=3D"">This is the distribution =
of alignment buffered from the last 12 hours.</div><div class=3D""><br =
class=3D""></div></span><div class=3D""><span =
id=3D"m_-1992059786246775598m_-4092727656205936872m_-4329213960522156650ci=
d:ii_15ec2962fa806c37" class=3D"">&lt;???? 2017-09-2 7 =
=E4=B8=8B=E5=8D=885.05.11.png&gt;</span><br class=3D""></div><span =
class=3D""><div class=3D""><br class=3D""></div><div class=3D"">And here =
is the buffer out pool usage during chk #1140 ~ #1142. For chk #1245 and =
#1246, you can check the picture I sent before.</div><div class=3D""><br =
class=3D""></div></span><div class=3D"">&nbsp;<span =
id=3D"m_-1992059786246775598m_-4092727656205936872m_-4329213960522156650ci=
d:ii_15ec2921382dcce5" class=3D"">&lt;???? 2017-09-2 7 =
=E4=B8=8B=E5=8D=885.01.24.png&gt;</span></div><div class=3D""><div =
class=3D"m_-1992059786246775598m_-4092727656205936872h5"><div =
class=3D""><br class=3D""></div><div class=3D"">AFAIK, the back pressure =
rate usually is in LOW status, sometimes goes up to HIGH, and always OK =
during the night.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Best Regards,</div><div class=3D"">Tony Wei</div><div =
class=3D""><br class=3D""></div></div></div></div><div class=3D""><div =
class=3D"m_-1992059786246775598m_-4092727656205936872h5"><div =
class=3D"m_-1992059786246775598m_-4092727656205936872m_-432921396052215665=
0HOEnZb"><div =
class=3D"m_-1992059786246775598m_-4092727656205936872m_-432921396052215665=
0h5"><div class=3D"gmail_extra"><br class=3D""><div =
class=3D"gmail_quote">2017-09-27 16:54 GMT+08:00 Stefan Richter <span =
dir=3D"ltr" class=3D"">&lt;<a href=3D"mailto:s.richter@data-artisans.com" =
target=3D"_blank" =
class=3D"">s.richter@data-artisans.com</a>&gt;</span>:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi Tony,<div class=3D""><br =
class=3D""></div><div class=3D"">are your checkpoints typically close to =
the timeout boundary? =46rom what I see, writing the checkpoint is =
relatively fast but the time from the checkpoint trigger to execution =
seems very long. This is typically the case if your job has a lot of =
backpressure and therefore the checkpoint barriers take a long time to =
travel to the operators, because a lot of events are piling up in the =
buffers. Do you also experience large alignments for your =
checkpoints?</div><div class=3D""><br class=3D""></div><div =
class=3D"">Best,</div><div class=3D"">Stefan &nbsp;</div><div =
class=3D""><div =
class=3D"m_-1992059786246775598m_-4092727656205936872m_-432921396052215665=
0m_-2274116713354146242h5"><div class=3D""><br class=3D""><div =
class=3D""><blockquote type=3D"cite" class=3D""><div class=3D"">Am =
27.09.2017 um 10:43 schrieb Tony Wei &lt;<a =
href=3D"mailto:tony19920430@gmail.com" target=3D"_blank" =
class=3D"">tony19920430@gmail.com</a>&gt;:</div><br =
class=3D"m_-1992059786246775598m_-4092727656205936872m_-432921396052215665=
0m_-2274116713354146242m_8994748430564412578Apple-interchange-newline"><di=
v class=3D""><div dir=3D"ltr" class=3D"">Hi Stefan,<div class=3D""><br =
class=3D""></div><div class=3D"">It seems that I found something strange =
from JM's log.</div><div class=3D""><br class=3D""></div><div =
class=3D"">It had happened more than once before, but all subtasks would =
finish their checkpoint attempts in the end.</div><div class=3D""><br =
class=3D""></div><div class=3D"">2017-09-26 01:23:28,690 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Triggering checkpoint 1140 @ 1506389008690</div><div =
class=3D"">2017-09-26 01:28:28,690 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Triggering checkpoint 1141 @ 1506389308690</div><div =
class=3D"">2017-09-26 01:33:28,690 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Triggering checkpoint 1142 @ 1506389608690</div><div =
class=3D"">2017-09-26 01:33:28,691 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Checkpoint 1140 expired before completing.</div><div =
class=3D"">2017-09-26 01:38:28,691 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Checkpoint 1141 expired before completing.<br class=3D""></div><div =
class=3D"">2017-09-26 01:40:38,044 WARN =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Received late message for now expired checkpoint attempt 1140 from =
c63825d15de0fef55a1d148adcf446<wbr class=3D"">7e of job 7c039572b...<br =
class=3D""></div><div class=3D"">2017-09-26 01:40:53,743 WARN =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Received late message for now expired checkpoint attempt 1141 from =
c63825d15de0fef55a1d148adcf446<wbr class=3D"">7e of job 7c039572b...<br =
class=3D""></div><div class=3D"">2017-09-26 01:41:19,332 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Completed checkpoint 1142 (136733704 bytes in 457413 ms).<br =
class=3D""></div><div class=3D""><br class=3D""></div><div class=3D"">For =
chk #1245 and #1246, there was no late message from TM. You can refer to =
the TM log. The full completed checkpoint attempt will have 12 =
(...&nbsp;asynchronous part) logs in general, but #1245 and #1246 only =
got 10 logs.</div><div class=3D""><br class=3D""></div><div =
class=3D"">2017-09-26 10:08:28,690 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Triggering checkpoint 1245 @ 1506420508690</div><div =
class=3D"">2017-09-26 10:13:28,690 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Triggering checkpoint 1246 @ 1506420808690</div><div =
class=3D"">2017-09-26 10:18:28,691 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Checkpoint 1245 expired before completing.<br class=3D""></div><div =
class=3D"">2017-09-26 10:23:28,691 INFO =
org.apache.flink.runtime.check<wbr class=3D"">point.CheckpointCoordinator =
- Checkpoint 1246 expired before completing.<br class=3D""></div><div =
class=3D""><br class=3D""></div><div class=3D"">Moreover, I listed the =
directory for checkpoints on S3 and saw there were two states not =
discarded successfully. In general, there will be 16 parts for a =
completed checkpoint state.</div><div class=3D""><br class=3D""></div><div=
 class=3D"">2017-09-26 18:08:33 36919 tony-dev/flink-checkpoints/7c0<wbr =
class=3D"">39572b13346f1b17dcc0ace2b72c2/<wbr =
class=3D"">chk-1245/eedd7ca5-ee34-45a5-bf<wbr =
class=3D"">0b-11cc1fc67ab8</div><div class=3D"">2017-09-26 18:13:34 =
37419 tony-dev/flink-checkpoints/7c0<wbr =
class=3D"">39572b13346f1b17dcc0ace2b72c2/<wbr =
class=3D"">chk-1246/9aa5c6c4-8c74-465d-85<wbr =
class=3D"">09-5fea4ed25af6<br class=3D""></div><div class=3D""><br =
class=3D""></div><div class=3D"">Hope these informations are helpful. =
Thank you.</div><div class=3D""><br class=3D""></div><div class=3D"">Best =
Regards,</div><div class=3D"">Tony Wei</div></div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">2017-09-27=
 16:14 GMT+08:00 Stefan Richter <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:s.richter@data-artisans.com" target=3D"_blank" =
class=3D"">s.richter@data-artisans.com</a>&gt;</span>:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi,<div class=3D""><br =
class=3D""></div><div class=3D"">thanks for the information. =
Unfortunately, I have no immediate idea what the reason is from the =
given information. I think most helpful could be a thread dump, but also =
metrics on the operator operator level to figure out which part of the =
pipeline is the culprit.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Best,</div><div class=3D"">Stefan</div><div class=3D""><div =
class=3D"m_-1992059786246775598m_-4092727656205936872m_-432921396052215665=
0m_-2274116713354146242m_8994748430564412578h5"><div class=3D""><br =
class=3D""><div class=3D""><blockquote type=3D"cite" class=3D""><div =
class=3D"">Am 26.09.2017 um 17:55 schrieb Tony Wei &lt;<a =
href=3D"mailto:tony19920430@gmail.com" target=3D"_blank" =
class=3D"">tony19920430@gmail.com</a>&gt;:</div><br =
class=3D"m_-1992059786246775598m_-4092727656205936872m_-432921396052215665=
0m_-2274116713354146242m_8994748430564412578m_8294641495293632325Apple-int=
erchange-newline"><div class=3D""><div dir=3D"ltr" class=3D"">Hi =
Stefan,<div class=3D""><br class=3D""></div><div class=3D"">There is no =
unknown exception in my full log. The Flink version is 1.3.2.</div><div =
class=3D"">My job is roughly like this.</div><div class=3D""><br =
class=3D""></div><div class=3D"">env.addSource(Kafka)</div><div =
class=3D"">&nbsp; .map(ParseKeyFromRecord)</div><div class=3D"">&nbsp; =
.keyBy()</div><div class=3D"">&nbsp; .process(CountAndTimeoutWindow<wbr =
class=3D"">)</div><div class=3D"">&nbsp; .asyncIO(UploadToS3)</div><div =
class=3D"">&nbsp; .addSink(UpdateDatabase)</div><div class=3D""><br =
class=3D""></div><div class=3D"">It seemed all tasks stopped like the =
picture I sent in the last email.</div><div class=3D""><br =
class=3D""></div><div class=3D"">I will keep my eye on taking a thread =
dump from that JVM if this happens again.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Best Regards,</div><div class=3D"">Tony =
Wei</div></div><div class=3D"gmail_extra"><br class=3D""><div =
class=3D"gmail_quote">2017-09-26 23:46 GMT+08:00 Stefan Richter <span =
dir=3D"ltr" class=3D"">&lt;<a href=3D"mailto:s.richter@data-artisans.com" =
target=3D"_blank" =
class=3D"">s.richter@data-artisans.com</a>&gt;</span>:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br class=3D"">
<br class=3D"">
that is very strange indeed. I had a look at the logs and there is no =
error or exception reported. I assume there is also no exception in your =
full logs? Which version of flink are you using and what operators were =
running in the task that stopped? If this happens again, would it be =
possible to take a thread dump from that JVM?<br class=3D"">
<br class=3D"">
Best,<br class=3D"">
Stefan<br class=3D"">
<span class=3D""><br class=3D"">
&gt; Am 26.09.2017 um 17:08 schrieb Tony Wei &lt;<a =
href=3D"mailto:tony19920430@gmail.com" target=3D"_blank" =
class=3D"">tony19920430@gmail.com</a>&gt;:<br class=3D"">
&gt;<br class=3D"">
&gt; Hi,<br class=3D"">
&gt;<br class=3D"">
&gt; Something weird happened on my streaming job.<br class=3D"">
&gt;<br class=3D"">
&gt; I found my streaming job seems to be blocked for a long time and I =
saw the situation like the picture below. (chk #1245 and #1246 were all =
finishing 7/8 tasks then marked timeout by JM. Other checkpoints failed =
with the same state like #1247 util I restarted TM.)<br class=3D"">
&gt;<br class=3D"">
</span>&gt; &lt;snapshot.png&gt;<br class=3D"">
<span class=3D"">&gt;<br class=3D"">
&gt; I'm not sure what happened, but the consumer stopped fetching =
records, buffer usage is 100% and the following task did not seem to =
fetch data anymore. Just like the whole TM was stopped.<br class=3D"">
&gt;<br class=3D"">
&gt; However, after I restarted TM and force the job restarting from the =
latest completed checkpoint, everything worked again. And I don't know =
how to reproduce it.<br class=3D"">
&gt;<br class=3D"">
&gt; The attachment is my TM log. Because there are many user logs and =
sensitive information, I only remain the log from =
`org.apache.flink...`.<br class=3D"">
&gt;<br class=3D"">
&gt; My cluster setting is one JM and one TM with 4 available slots.<br =
class=3D"">
&gt;<br class=3D"">
&gt; Streaming job uses all slots, checkpoint interval is 5 mins and max =
concurrent number is 3.<br class=3D"">
&gt;<br class=3D"">
&gt; Please let me know if it needs more information to find out what =
happened on my streaming job. Thanks for your help.<br class=3D"">
&gt;<br class=3D"">
&gt; Best Regards,<br class=3D"">
&gt; Tony Wei<br class=3D"">
</span>&gt; &lt;flink-root-taskmanager-0-part<wbr =
class=3D"">ial.log&gt;<br class=3D"">
<br class=3D"">
</blockquote></div><br class=3D""></div>
</div></blockquote></div><br =
class=3D""></div></div></div></div></blockquote></div><br =
class=3D""></div>
</div></blockquote></div><br =
class=3D""></div></div></div></div></blockquote></div><br =
class=3D""></div>
</div></div></div></div></blockquote></div><br class=3D""></div>
</div></blockquote></div><br class=3D""></div></div></blockquote></div><br=
 class=3D""></div>
</div></div></div></blockquote></div><br =
class=3D""></div></div></blockquote></div><br class=3D""></div>
</div></div></blockquote></div><br class=3D""></div>
<span id=3D"cid:3585D5D7-9E25-4CFC-97C8-56106144C55A@fritz.box">&lt;chk_ =
1577.log&gt;</span></div></blockquote></div><br class=3D""></body></html>=

--Apple-Mail=_A84A07F5-7FE0-47F0-BE1A-96165A7C7217--