flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Bhushan Ratnakar <ravibhushanratna...@gmail.com>
Subject Re: Checkpointing is not performing well
Date Sat, 07 Sep 2019 21:38:05 GMT
Hi Rafi,

Thank you for your quick response.

I have tested with rocksdb state backend. Rocksdb required significantly
more taskmanager to perform as compare to filesystem state backend. The
problem here is that checkpoint process is not fast enough to complete.

Our requirement is to do checkout as soon as possible like in 5 seconds to
flush the output to output sink. As the incoming data rate is high, it is
not able to complete quickly. If I increase the checkpoint duration, the
state size grows much faster and hence takes much longer time to complete
checkpointing. I also tried to use AT LEAST ONCE mode, but does not improve
much. Adding more taskmanager to increase parallelism also does not improve
the checkpointing performance.

Is it possible to achieve checkpointing as short as 5 seconds with such
high input volume?


On Sat 7 Sep, 2019, 22:25 Rafi Aroch, <rafi.aroch@gmail.com> wrote:

> Hi Ravi,
> Consider moving to RocksDB state backend, where you can enable incremental
> checkpointing. This will make you checkpoints size stay pretty much
> constant even when your state becomes larger.
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/state_backends.html#the-rocksdbstatebackend
> Thanks,
> Rafi
> On Sat, Sep 7, 2019, 17:47 Ravi Bhushan Ratnakar <
> ravibhushanratnakar@gmail.com> wrote:
>> Hi All,
>> I am writing a streaming application using Flink 1.9. This application
>> consumes data from kinesis stream which is basically avro payload.
>> Application is using KeyedProcessFunction to execute business logic on the
>> basis of correlation id using event time characteristics with below
>> configuration --
>> StateBackend - filesystem with S3 storage
>> registerTimeTimer duration for each key is  -  currentWatermark  + 15
>> seconds
>> checkpoint interval - 1min
>> minPauseBetweenCheckpointInterval - 1 min
>> checkpoint timeout - 10mins
>> incoming data rate from kinesis -  ~10 to 21GB/min
>> Number of Task manager - 200 (r4.2xlarge -> 8cpu,61GB)
>> First 2-4 checkpoints get completed within 1mins where the state size is
>> usually 50GB. As the state size grows beyond 50GB, then checkpointing time
>> starts taking more than 1mins and it increased till 10 mins and then
>> checkpoint fails. The moment the checkpoint starts taking more than 1 mins
>> to complete then application starts processing slow and start lagging in
>> output.
>> Any suggestion to fine tune checkpoint performance would be highly
>> appreciated.
>> Regards,
>> Ravi

View raw message