flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From qi luo <luoqi...@gmail.com>
Subject Re: Flink Exception - assigned slot container was removed
Date Mon, 26 Nov 2018 11:19:47 GMT
This is weird. Could you paste your entire exception trace here?

> On Nov 26, 2018, at 4:37 PM, Flink Developer <developer143@protonmail.com> wrote:
> 
> In addition, after the Flink job has failed from the above exception, the Flink job is
unable to recover from previous checkpoint. Is this the expected behavior? How can the job
be recovered successfully from this?
> 
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, November 26, 2018 12:30 AM, Flink Developer <developer143@protonmail.com>
wrote:
> 
>> Thanks for the suggestion Qi. I tried increasing slot.idle.timeout to 3600000 but
it seems to still have encountered the issue. Does this mean if a slot or "flink worker" has
not processed items for 1 hour, that it will be removed?
>> 
>> Would any other flink configuration properties help for this?
>> 
>> slot.request.timeout
>> web.timeout
>> heartbeat.interval
>> heartbeat.timeout
>> 
>> 
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Sunday, November 25, 2018 6:56 PM, 罗齐 <luoqi.bd@bytedance.com> wrote:
>> 
>>> Hi,
>>> 
>>> It looks that some of your slots were freed during the job execution (possibly
due to idle for too long). AFAIK the exception was thrown when a pending Slot request was
removed. You can try increase the “Slot.idle.timeout” to mitigate this issue (default
is 50000, try 3600000 or higher).
>>> 
>>> Regards,
>>> Qi
>>> 
>>>> On Nov 26, 2018, at 7:36 AM, Flink Developer <developer143@protonmail.com
<mailto:developer143@protonmail.com>> wrote:
>>>> 
>>>> Hi, I have a Flink application sourcing from a topic in Kafka (400 partitions)
and sinking to S3 using bucketingsink and using RocksDb for checkpointing every 2 mins. The
Flink app runs with parallelism 400 so that each worker handles a partition. This is using
Flink 1.5.2. The Flink cluster uses 10 task managers with 40 slots each.
>>>> 
>>>> After running for a few days straight, it encounters a Flink exception:
>>>> Org.apache.flink.util.FlinkException: The assigned slot container_1234567_0003_01_000009_1
was removed.
>>>> 
>>>> This causes the Flink job to fail. It is odd to me. I am unsure what causes
this. Also, during this time, I see some checkpoints stating "checkpoint was declined (tasks
not ready)". At this point, the job is unable to recover and fails. Does this happen if a
slot or worker is not doing processing for X amount of time? Would I need to increase the
Flink config properties for the following when creating the Flink cluster in yarn?
>>>> 
>>>> Slot.idle.timeout
>>>> Slot.request.timeout
>>>> Web.timeout
>>>> Heartbeat.interval
>>>> Heartbeat.timeout
>>>> 
>>>> Any help would be greatly appreciated.
>>>> 
>> 
> 


Mime
View raw message