flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Richter <s.rich...@data-artisans.com>
Subject Re: Stream Task seems to be blocked after checkpoint timeout
Date Wed, 27 Sep 2017 08:14:52 GMT
Hi,

thanks for the information. Unfortunately, I have no immediate idea what the reason is from
the given information. I think most helpful could be a thread dump, but also metrics on the
operator operator level to figure out which part of the pipeline is the culprit.

Best,
Stefan

> Am 26.09.2017 um 17:55 schrieb Tony Wei <tony19920430@gmail.com>:
> 
> Hi Stefan,
> 
> There is no unknown exception in my full log. The Flink version is 1.3.2.
> My job is roughly like this.
> 
> env.addSource(Kafka)
>   .map(ParseKeyFromRecord)
>   .keyBy()
>   .process(CountAndTimeoutWindow)
>   .asyncIO(UploadToS3)
>   .addSink(UpdateDatabase)
> 
> It seemed all tasks stopped like the picture I sent in the last email.
> 
> I will keep my eye on taking a thread dump from that JVM if this happens again.
> 
> Best Regards,
> Tony Wei
> 
> 2017-09-26 23:46 GMT+08:00 Stefan Richter <s.richter@data-artisans.com <mailto:s.richter@data-artisans.com>>:
> Hi,
> 
> that is very strange indeed. I had a look at the logs and there is no error or exception
reported. I assume there is also no exception in your full logs? Which version of flink are
you using and what operators were running in the task that stopped? If this happens again,
would it be possible to take a thread dump from that JVM?
> 
> Best,
> Stefan
> 
> > Am 26.09.2017 um 17:08 schrieb Tony Wei <tony19920430@gmail.com <mailto:tony19920430@gmail.com>>:
> >
> > Hi,
> >
> > Something weird happened on my streaming job.
> >
> > I found my streaming job seems to be blocked for a long time and I saw the situation
like the picture below. (chk #1245 and #1246 were all finishing 7/8 tasks then marked timeout
by JM. Other checkpoints failed with the same state like #1247 util I restarted TM.)
> >
> > <snapshot.png>
> >
> > I'm not sure what happened, but the consumer stopped fetching records, buffer usage
is 100% and the following task did not seem to fetch data anymore. Just like the whole TM
was stopped.
> >
> > However, after I restarted TM and force the job restarting from the latest completed
checkpoint, everything worked again. And I don't know how to reproduce it.
> >
> > The attachment is my TM log. Because there are many user logs and sensitive information,
I only remain the log from `org.apache.flink...`.
> >
> > My cluster setting is one JM and one TM with 4 available slots.
> >
> > Streaming job uses all slots, checkpoint interval is 5 mins and max concurrent number
is 3.
> >
> > Please let me know if it needs more information to find out what happened on my
streaming job. Thanks for your help.
> >
> > Best Regards,
> > Tony Wei
> > <flink-root-taskmanager-0-partial.log>
> 
> 


Mime
View raw message