flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yangze Guo <karma...@gmail.com>
Subject Re: TaskManager crash after cancelling a job
Date Tue, 27 Jul 2021 02:22:49 GMT
Hi, Ivan

My gut feeling is that it is related to FLINK-22535. Could @Yun Gao
take another look? If that is the case, you can upgrade to 1.13.1.

Yangze Guo

On Tue, Jul 27, 2021 at 9:41 AM Ivan Yang <ivanygyang@gmail.com> wrote:
> Dear Flink experts,
> We recently ran into an issue during a job cancellation after upgraded to 1.13. After
we issue a cancel (from Flink console or flink cancel {jobid}), a few subtasks stuck in cancelling
state. Once it gets to that situation, the behavior is consistent. Those “cancelling tasks
will never become canceled. After 3 minutes, The job stopped, as a result, number of task
manages were lost. It will take about another 5 minute for the those lost task manager to
rejoin the Job manager. Then we can restart the job from the previous checkpoint. Found an
exception from the hanging (cancelling) Task Manager.
> ==========================================
>         sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705) org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637) org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) java.lang.Thread.run(Thread.java:748)
> ===========================================
> Here are some background information about our job and setup.
> 1) The job is relatively large, we have 500+ parallelism and 2000+ subtasks. It’s mainly
reading from a Kinesis stream and perform some transformation and fanout to multiple output
s3 buckets. It’s a stateless ETL job.
> 2) The same code and setup running on smaller environments don’t seem to have this
cancel failure problem.
> 3) We have been using Flink 1.11.0 for the same job, and never seen this cancel failure
and killing Task Manager problem.
> 4) With upgrading to 1.13, we also added Kubernetes HA (zookeeperless). Pervious we don’t
not use HA.
> The cancel and restart from previous checkpoint is our regular procedure to support daily
operation. With this 10 minutes TM restart cycle, it basically slowed down our throughput.
I try to understand what leads into this situation. Hoping maybe some configuration change
will smooth things out. Also any suggestion to shorten the waiting. It appears to be some
timeout on the TaskManager and JobManager can be  adjusted to speed it up. But really want
to avoid stuck in cancellation if we can.
> Thanks you, hoping to get some insight knowledge here.
> Ivan

View raw message