flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flink Developer <developer...@protonmail.com>
Subject Flink Exception - assigned slot container was removed
Date Sun, 25 Nov 2018 23:36:18 GMT
Hi, I have a Flink application sourcing from a topic in Kafka (400 partitions) and sinking
to S3 using bucketingsink and using RocksDb for checkpointing every 2 mins. The Flink app
runs with parallelism 400 so that each worker handles a partition. This is using Flink 1.5.2.
The Flink cluster uses 10 task managers with 40 slots each.

After running for a few days straight, it encounters a Flink exception:
Org.apache.flink.util.FlinkException: The assigned slot container_1234567_0003_01_000009_1
was removed.

This causes the Flink job to fail. It is odd to me. I am unsure what causes this. Also, during
this time, I see some checkpoints stating "checkpoint was declined (tasks not ready)". At
this point, the job is unable to recover and fails. Does this happen if a slot or worker is
not doing processing for X amount of time? Would I need to increase the Flink config properties
for the following when creating the Flink cluster in yarn?

Slot.idle.timeout
Slot.request.timeout
Web.timeout
Heartbeat.interval
Heartbeat.timeout

Any help would be greatly appreciated.
Mime
View raw message