flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (FLINK-9635) Local recovery scheduling can cause spread out of tasks
Date Thu, 01 Nov 2018 10:37:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Till Rohrmann closed FLINK-9635.
       Resolution: Fixed
    Fix Version/s: 1.6.3

Fixed in 1.6.3 via https://github.com/apache/flink/commit/04df02b4728d40b59417ccc8ee281ab3298b09da

> Local recovery scheduling can cause spread out of tasks
> -------------------------------------------------------
>                 Key: FLINK-9635
>                 URL: https://issues.apache.org/jira/browse/FLINK-9635
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0, 1.6.2
>            Reporter: Till Rohrmann
>            Assignee: Stefan Richter
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.6.3, 1.7.0
> In order to make local recovery work, Flink's scheduling was changed such that it tries
to be rescheduled to its previous location. In order to not occupy slots which have state
of other tasks cached, the strategy will request a new slot if the old slot identified by
the previous allocation id is no longer present. This also applies to newly allocated slots
because there is no distinction between new or already used. This behaviour can cause that
every tasks gets deployed to its own slot if the {{SlotPool}} has released all slots in the
meantime, for example. The consequence could be that a job can no longer be executed after
a failure because it needs more slots than before.

This message was sent by Atlassian JIRA

View raw message