flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juho Autio <juho.au...@rovio.com>
Subject Re: 1.5.1
Date Mon, 13 Aug 2018 07:52:04 GMT
I also have jobs failing on a daily basis with the error "Heartbeat of
TaskManager with id <id> timed out". I'm using Flink 1.5.2.

Could anyone suggest how to debug possible causes?

I already set these in flink-conf.yaml, but I'm still getting failures:
heartbeat.interval: 10000
heartbeat.timeout: 100000

Thanks.

On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <vishal.santoshi@gmail.com>
wrote:

> According to the UI it seems that "
>
> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10
was removed.
>
> " was the cause of a pipe restart.
>
> As to the TM it is an artifact of the new job allocation regime which will
> exhaust all slots on a TM rather then distributing them equitably.  TMs
> selectively are under more stress then in a pure RR distribution I think.
> We may have to lower the slots on each TM to define a good upper bound. You
> are correct 50s is a a pretty generous value.
>
> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <gary@data-artisans.com> wrote:
>
>> Hi,
>>
>> The first exception should be only logged on info level. It's expected to
>> see
>> this exception when a TaskManager unregisters from the ResourceManager.
>>
>> Heartbeats can be configured via heartbeat.interval and hearbeat.timeout
>> [1].
>> The default timeout is 50s, which should be a generous value. It is
>> probably a
>> good idea to find out why the heartbeats cannot be answered by the TM.
>>
>> Best,
>> Gary
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager
>>
>>
>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>
>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10
was removed.
>>>
>>>
>>> and
>>>
>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600
timed out.
>>>
>>>
>>> Not sure about the first but how do we increase the heartbeat interval
>>> of a TM
>>>
>>> Thanks much
>>>
>>> Vishal
>>>
>>
>>
>

Mime
View raw message