flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9047) SlotPool can fail to release slots
Date Wed, 21 Mar 2018 17:49:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408322#comment-16408322
] 

ASF GitHub Bot commented on FLINK-9047:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/5739

    [FLINK-9047] Fix slot recycling in case of failed release

    ## What is the purpose of the change
    
    In case that a slot cannot be released it will only recycled/reused if the owning
    TaskExecutor is still registered at the SlotPool. If this is not the case then we
    drop the slot from the SlotPool.
    
    cc @GJL 
    
    ## Brief change log
    
    - Only recycle slots which could not be released if owner is still registered
    
    ## Verifying this change
    
    - Added `SlotPoolTest#testReleasingIdleSlotFailed`.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing,
Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink moreLogging

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5739.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5739
    
----
commit fbaa3c1c94af6d7ba26ec6e554e35d9d9b400054
Author: Till Rohrmann <trohrmann@...>
Date:   2018-03-21T14:09:46Z

    [hotfix] Improve Flip-6 component logging

commit 46cda67300fd153828769af8b4139b64b60e34d4
Author: Till Rohrmann <trohrmann@...>
Date:   2018-03-21T15:48:52Z

    [FLINK-9047] Fix slot recycling in case of failed release
    
    In case that a slot cannot be released it will only recycled/reused if the owning
    TaskExecutor is still registered at the SlotPool. If this is not the case then we
    drop the slot from the SlotPool.

commit c9f3e037021dc8c37eceaaac181eb19c141f0bb7
Author: Till Rohrmann <trohrmann@...>
Date:   2018-03-21T17:44:25Z

    [hotfix] Remove unused method from SlotPool

----


> SlotPool can fail to release slots
> ----------------------------------
>
>                 Key: FLINK-9047
>                 URL: https://issues.apache.org/jira/browse/FLINK-9047
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Till Rohrmann
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> The {{SlotPool}} releases idling slots. If the release operation fails (e.g. timeout),
then it simply continues using the slot. This is problematic if the owning {{TaskExecutor}}
failed before and was unregistered in the meantime from the {{SlotPool}}. As a result, the
{{SlotPool}} will reuse the slot and whenever it tries to return because it is idling it will
fail again. This, effectively, renders the scheduling of a job impossible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message