flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9190) YarnResourceManager sometimes does not request new Containers
Date Sat, 05 May 2018 20:13:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464885#comment-16464885

ASF GitHub Bot commented on FLINK-9190:

Github user StephanEwen commented on the issue:

    @sihuazhou and @shuai-xu thank you for your help in understanding the bug here.
    Let me rephrase it to make sure I understand the problem exactly. The steps are the following:
      1. JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager
      2. ResourceManager starts a container with a TaskManager
      3. TaskManager registers at ResourceManager, which tells the TaskManager to push a slot
to the JobManager.
      4. TaskManager container is killed
      5. The ResourceManager does not queue back the slot requests (AllocationIDs) that it
sent to the previous TaskManager, so the requests are lost and need to time out before another
attempt is tried.
    Some thoughts on how to deal with this:
      - It seems the ResourceManager should put the slots from the TaskManager that was failed
back to "pending" so they are given to the next TaskManager that starts.
      - I assume that is not happening, because there is concern that the failure is also
detected on the JobManager/SlotPool and retried there and there are double re-tries
      - The solution would be to better define the protocol with respect to who is responsible
for what retries.
    Two ideas on how to fix that:
      1. The ResourceManager notifies the SlotPool that a certain set of AllocationIDs has
failed, and the SlotPool directly retries the allocations, resulting in directly starting
new containers.
      2. The ResourceManager always retries allocations for AllocationIDs it knows. The SlotPool
would not retry, it would keep the same allocations always unless they are released as unneeded.
We would probably need something to make sure that the SlotPool can distinguish from different
offers of the same AllocationID (in case the ResourceManager assumes a timeout but a request
goes actually through) - possibly something like an attempt-counter (higher wins).
    @tillrohrmann also interested in your thoughts here.

> YarnResourceManager sometimes does not request new Containers
> -------------------------------------------------------------
>                 Key: FLINK-9190
>                 URL: https://issues.apache.org/jira/browse/FLINK-9190
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, YARN
>    Affects Versions: 1.5.0
>         Environment: Hadoop 2.8.3
> ZooKeeper 3.4.5
> Flink 71c3cd2781d36e0a03d022a38cc4503d343f7ff8
>            Reporter: Gary Yao
>            Assignee: Gary Yao
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>         Attachments: yarn-logs
> *Description*
> The {{YarnResourceManager}} does not request new containers if {{TaskManagers}} are killed
rapidly in succession. After 5 minutes the job is restarted due to {{NoResourceAvailableException}},
and the job runs normally afterwards. I suspect that {{TaskManager}} failures are not registered
if the failure occurs before the {{TaskManager}} registers with the master. Logs are attached;
I added additional log statements to {{YarnResourceManager.onContainersCompleted}} and {{YarnResourceManager.onContainersAllocated}}.
> *Expected Behavior*
> The {{YarnResourceManager}} should recognize that the container is completed and keep
requesting new containers. The job should run as soon as resources are available. 

This message was sent by Atlassian JIRA

View raw message