flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhijiang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-6325) Refinement of slot reuse for task manager failure
Date Wed, 19 Apr 2017 08:27:41 GMT
zhijiang created FLINK-6325:

             Summary: Refinement of slot reuse for task manager failure
                 Key: FLINK-6325
                 URL: https://issues.apache.org/jira/browse/FLINK-6325
             Project: Flink
          Issue Type: Improvement
          Components: JobManager
            Reporter: zhijiang
            Priority: Minor

After task or TaskManager failure, the new execution attempt tries to take the slot from prior
execution by default. It can get benefits for state recovery locality by RocksDB backend,
and it actually makes sense for task failure scenario.
But for TaskManager failure scenario, the inside slot is recycled and can not be reused any
more. When the inside execution resets to allocate slot from {{SlotPool}}, no slot can be
matched by {{ResourceID}}, then it will try to match any other available slots by {{ResourceProfile}}.
As a result, the other parallel execution's slot will be occupied by this execution in failed
{{TaskManager}}, and all the following executions may not reuse the previous slots any more.
It will bring bad effects for state recovery.
To solve this problem, we would like to request a new slot for re-deployment when attached
with an unavailable location, so it will not occupy the other alive slots any more.

This message was sent by Atlassian JIRA

View raw message