flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shimin Yang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-9567) Flink does not release resource in Yarn Cluster mode
Date Fri, 15 Jun 2018 02:27:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shimin Yang updated FLINK-9567:
-------------------------------
    Attachment: FlinkYarnProblem

> Flink does not release resource in Yarn Cluster mode
> ----------------------------------------------------
>
>                 Key: FLINK-9567
>                 URL: https://issues.apache.org/jira/browse/FLINK-9567
>             Project: Flink
>          Issue Type: Bug
>          Components: Cluster Management, YARN
>    Affects Versions: 1.5.0
>            Reporter: Shimin Yang
>            Priority: Major
>         Attachments: FlinkYarnProblem
>
>
> After restart the Job Manager in Yarn Cluster mode, Flink does not release task manager
containers in some specific case.
> In the first log I posted, the container with id 24 is the reason why Yarn did not release
resources. Although the Task Manager in the container with id 24 was released before restart. 
> But in line 347, 
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - Association with
remote system [akka.tcp://flink@bd-r1hdp69:30609] has failed, address is now gated for [50]
ms. Reason: [Disassociated] 
> this problem caused flink to request for one more container more than need. As the excessive
container return id determined by the *numPendingContainerRequests* variable in *YarnResourceManager*,
I think it's the *onContainersCompleted* in *YarnResourceManager* called the method *requestYarnContainer* which
leads to the increase of *numPendingContainerRequests.* However, the restart logic has already
allocated enough containers for Task Managers, Flink will possess the extra container for
a long time for nothing. In the worst case, I had a job configured to 5 task managers, but
possess more than 100 containers in the end.
> ps: Another strange thing I found is that when sometimes request for a yarn container,
it will return much more than requested. Is it a normal scenario for AMRMAsyncClient?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message