flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flink Developer <developer...@protonmail.com>
Subject Flink Exception - assigned slot container was removed
Date Sun, 25 Nov 2018 23:36:18 GMT
Hi, I have a Flink application sourcing from a topic in Kafka (400 partitions) and sinking
to S3 using bucketingsink and using RocksDb for checkpointing every 2 mins. The Flink app
runs with parallelism 400 so that each worker handles a partition. This is using Flink 1.5.2.
The Flink cluster uses 10 task managers with 40 slots each.

After running for a few days straight, it encounters a Flink exception:
Org.apache.flink.util.FlinkException: The assigned slot container_1234567_0003_01_000009_1
was removed.

This causes the Flink job to fail. It is odd to me. I am unsure what causes this. Also, during
this time, I see some checkpoints stating "checkpoint was declined (tasks not ready)". At
this point, the job is unable to recover and fails. Does this happen if a slot or worker is
not doing processing for X amount of time? Would I need to increase the Flink config properties
for the following when creating the Flink cluster in yarn?


Any help would be greatly appreciated.
View raw message