flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou [FDS Science]" <yz...@coupang.com>
Subject checkpoint/taskmanager is stuck, deadlock on LocalBufferPool
Date Mon, 22 Oct 2018 18:26:43 GMT
Hi,
My application suddenly stuck and completely doesn't move forward after running for a few
days. No exceptions are found. From the thread dump, I can see that the operator threads and
checkpoint threads deadlock on LocalBufferPool. LocalBufferPool is not able to request memory
and keep the lock. Please see the thread dump at the bottom.

It uses rocksdb as statebackend. From the heap dump and web ui, there are plenty of memory
in jvm and it doesn't have GC problem. Check points were good until there was the problem:

2018-10-19 04:41:23,691 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  
  - Triggering checkpoint 4347 @ 1539891683667 for job 1.
2018-10-19 04:41:45,069 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  
  - Completed checkpoint 4347 for job 1 (1019729450 bytes in 13600 ms).
2018-10-19 04:46:45,089 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  
  - Triggering checkpoint 4348 @ 1539892005069 for job 1.
2018-10-19 04:56:45,089 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  
  - Checkpoint 4348 of job 1 expired before completing.


This happened at mid night and the traffic was relatively low. Even if there was a spike and
caused a back pressure, to my understand that the events should be processed  eventually and
the network buffer would be available after that. What might be the cause of it?


Best
Yan

"Time Trigger for Source: Custom Source -> Flat Map -> Flat Map -> Timestamps/Watermarks
-> (from: ...s#363 daemon prio=5 os_prio=0 tid=0x00007ff187944000 nid=0x8f76 in Object.wait()
[0x00007ff12fda9000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247)
- locked <0x00000006dadeeac8> (a java.util.ArrayDeque)
at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.broadcastEmit(RecordWriter.java:117)
at org.apache.flink.streaming.runtime.io.StreamRecordWriter.broadcastEmit(StreamRecordWriter.java:87)
at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitWatermark(RecordWriterOutput.java:121)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:736)
at org.apache.flink.streaming.api.operators.ProcessOperator.processWatermark(ProcessOperator.java:72)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:479)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:736)
at org.apache.flink.streaming.api.operators.ProcessOperator.processWatermark(ProcessOperator.java:72)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:479)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.emitWatermark(OperatorChain.java:603)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator.onProcessingTime(TimestampsAndPeriodicWatermarksOperator.java:77)
at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:285)
- locked <0x00000006dcb3db50> (a java.lang.Object)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- <0x00000006eac38310> (a java.util.concurrent.ThreadPoolExecutor$Worker)


"Source: Custom Source -> Flat Map -> Flat Map -> Timestamps/Watermarks -> (from:
...s#116 prio=5 os_prio=0 tid=0x00007ff1a791a000 nid=0x78c9 waiting for monitor entry [0x00007ff14f6da000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordWithTimestamp(AbstractFetcher.java:397)
- waiting to lock <0x00000006dcb3db50> (a java.lang.Object)
at org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord(Kafka010Fetcher.java:89)
at org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:154)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:738)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:56)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:712)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- None


"Async calls on Source: Custom Source -> Flat Map -> Flat Map -> Timestamps/Watermarks
-> (from: (... sel#409 daemon prio=5 os_prio=0 tid=0x00007ff1682fc800 nid=0x943c waiting
for monitor entry [0x00007ff0dae3a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:620)
- waiting to lock <0x00000006dcb3db50> (a java.lang.Object)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:564)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.triggerCheckpoint(SourceStreamTask.java:116)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1229)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- <0x00000006ee7095f0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Mime
View raw message