flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-6434) There may be allocatedSlots leak in SlotPool
Date Thu, 02 Nov 2017 04:28:02 GMT

    [ https://issues.apache.org/jira/browse/FLINK-6434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235157#comment-16235157
] 

ASF GitHub Bot commented on FLINK-6434:
---------------------------------------

GitHub user shuai-xu opened a pull request:

    https://github.com/apache/flink/pull/4937

    [FLINK-6434] [runtime] cancel slot allocation if request timed out in ProviderAndOwner

    
    ## What is the purpose of the change
    
    This pr adds a cancel slot allocation protocol between ProviderAndOwner and SlotPool.
So that ProviderAndOwner can cancel the slot allocations no longer need to avoid slot leaking.
    
    ## Brief change log
    
      - *Let the ProviderAndOwner generate the allocation id before calling allocateSlot to
slot pool.*
      - *If the allocateSlot call timed out, ProviderAndOwner cancel the previos allocation
to slot pool.*
    
    ## Verifying this change
    
    This change added tests and can be verified as follows:
    
      - *Added unittest in SlotPoolRpcTest*
      - *Modify the existing SlotPoolTest*
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing,
Yarn/Mesos, ZooKeeper: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shuai-xu/flink jira-6434

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4937.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4937
    
----
commit ab3c599d55847451a1194ba55375207267561a71
Author: shuai.xus <shuai.xus@alibaba-inc.com>
Date:   2017-10-20T09:12:39Z

    [FLINK-6434] cancel slot allocation if request timed out in ProviderAndOwner
    
    Summary:
    This fix flink jira #6434
    1. Let the ProviderAndOwner generate the allcation id before calling allocateSlot to slot
pool.
    2. If the allocateSlot call timed out, ProviderAndOwner cancel the previos allocation
to slot pool.
    
    Test Plan: UnitTest
    
    Reviewers: haitao.w
    
    Differential Revision: https://aone.alibaba-inc.com/code/D323990

----


> There may be allocatedSlots leak in SlotPool
> --------------------------------------------
>
>                 Key: FLINK-6434
>                 URL: https://issues.apache.org/jira/browse/FLINK-6434
>             Project: Flink
>          Issue Type: Bug
>          Components: Cluster Management
>            Reporter: shuai.xu
>            Assignee: shuai.xu
>            Priority: Major
>              Labels: flip-6
>
> If the call allocateSlot() from Execution to Slotpool timeout, the job will begin to
failover, but the pending request are still in SlotPool, if then a new slot register to SlotPool,
it may be fulfill the outdated pending request and be added to allocatedSlots, but it will
never be used and will never be recycled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message