flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Lam <paullin3...@gmail.com>
Subject Flink performance drops when async checkpoint is slow
Date Thu, 28 Feb 2019 07:17:16 GMT
Hi,

I have a Flink job (version 1.5.3) that consumes from Kafka topic, does some transformations
and aggregates, and write to two Kafka topics respectively. Meanwhile, there’s a custom
source that pulls configurations for the transformations periodically. The generic job graph
is as below.



The job uses FsStateBackend and checkpoints to HDFS, but HDFS’s load is unstable, and sometimes
HDFS client reports slow read and slow waitForAckedSeqno during checkpoints. When that happens,
the Flink job consume rate drops significantly, and some taskmanager’ cpu usage drops from
about 140% to 1%, all the task threads on that taskmanager are blocked. This situation lasts
from seconds to a minute. We started a parallel job with everything the same except checkpointing
disabled, and it runs very steady.
But I think as the checkpointing is async, it should not affect the task threads.

There are some additional information that we observed:

-  When the performance drops, jstack shows that Kafka source and the task right after it
is blocked at requesting memory buffer (with back pressure close to 1), and the last task
is blocked at  `SingleInputGate.getNextBufferOrEvent`. 
- The dashboard shows that the buffer during alignment is less than 10 MB, even when back
pressure is high.

We’ve been struggling with this problem for weeks, and any help is appreciated. Thanks a
lot!

Best,
Paul Lam


Mime
View raw message