flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9635) Local recovery scheduling can cause spread out of tasks
Date Tue, 30 Oct 2018 10:56:02 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668533#comment-16668533
] 

ASF GitHub Bot commented on FLINK-9635:
---------------------------------------

tillrohrmann commented on a change in pull request #6961: [FLINK-9635] Fix scheduling for
local recovery
URL: https://github.com/apache/flink/pull/6961#discussion_r229253693
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/PreviousAllocationSchedulingStrategy.java
 ##########
 @@ -48,35 +50,68 @@ private PreviousAllocationSchedulingStrategy() {}
 	@Override
 	public <IN, OUT> OUT findMatchWithLocality(
 			@Nonnull SlotProfile slotProfile,
-			@Nonnull Stream<IN> candidates,
-			@Nonnull Function<IN, SlotContext> contextExtractor,
+			@Nonnull Supplier<Stream<IN>> candidates,
 
 Review comment:
   Not sure whether `Supplier<Stream<IN>>` is the best construct to be able to
iterate over a collection of `IN` multiple times. I think either `Collection` or `Iterable`
serve a better purpose here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Local recovery scheduling can cause spread out of tasks
> -------------------------------------------------------
>
>                 Key: FLINK-9635
>                 URL: https://issues.apache.org/jira/browse/FLINK-9635
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0, 1.6.2
>            Reporter: Till Rohrmann
>            Assignee: Stefan Richter
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.7.0
>
>
> In order to make local recovery work, Flink's scheduling was changed such that it tries
to be rescheduled to its previous location. In order to not occupy slots which have state
of other tasks cached, the strategy will request a new slot if the old slot identified by
the previous allocation id is no longer present. This also applies to newly allocated slots
because there is no distinction between new or already used. This behaviour can cause that
every tasks gets deployed to its own slot if the {{SlotPool}} has released all slots in the
meantime, for example. The consequence could be that a job can no longer be executed after
a failure because it needs more slots than before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message