flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhijiang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-6325) Refinement of slot reuse for task manager failure
Date Wed, 19 Apr 2017 09:53:41 GMT

    [ https://issues.apache.org/jira/browse/FLINK-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974401#comment-15974401
] 

zhijiang commented on FLINK-6325:
---------------------------------

[~StephanEwen], for specific implementation, we would like to add a flag to indicate whether
request new slot or not for {{AllocateSlot}} in {{SlotPool}}. 
In {{TaskManager}} failure scenario, the flag will be true, then the {{SlotPool}} will request
new slot from {{ResourceManager}} directly. 
In task failure scenario, the flag will be false, then the {{SlotPool}} will first match the
previous slots based on {{TaskManagerLocation}} from {{AvailableSlots}} which keeps same with
current mode.

Do you think this way makes sense or you have other suggestions? 

> Refinement of slot reuse for task manager failure
> -------------------------------------------------
>
>                 Key: FLINK-6325
>                 URL: https://issues.apache.org/jira/browse/FLINK-6325
>             Project: Flink
>          Issue Type: Improvement
>          Components: JobManager
>            Reporter: zhijiang
>            Assignee: zhijiang
>            Priority: Minor
>
> After task or TaskManager failure, the new execution attempt tries to take the slot from
prior execution by default. It can get benefits for state recovery locality by RocksDB backend,
and it actually makes sense for task failure scenario.
> But for TaskManager failure scenario, the inside slot is recycled and can not be reused
any more. When the inside execution resets to allocate slot from {{SlotPool}}, no slot can
be matched by {{ResourceID}}, then it will try to match any other available slots by {{ResourceProfile}}.
As a result, the other parallel execution's slot will be occupied by this execution in failed
{{TaskManager}}, and all the following executions may not reuse the previous slots any more.
It will bring bad effects for state recovery.
> To solve this problem, we would like to request a new slot for re-deployment when attached
with an unavailable location, so it will not occupy the other alive slots any more.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message