flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9190) YarnResourceManager sometimes does not request new Containers
Date Mon, 23 Apr 2018 02:49:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447492#comment-16447492

ASF GitHub Bot commented on FLINK-9190:

Github user sihuazhou commented on a diff in the pull request:

    --- Diff: flink-yarn/src/test/java/org/apache/flink/yarn/YarnResourceManagerTest.java
    @@ -388,4 +390,108 @@ public void testStopWorker() throws Exception {
     			assertTrue(resourceManager.getNumberOfRegisteredTaskManagers().get() == 0);
    +	/**
    +	 * Tests the case that containers are killed before registering with ResourceManager
    +	 */
    +	@Test
    +	public void testKillContainerBeforeTMRegisterSuccessfully() throws Exception {
    --- End diff --
    Hmm...most code of this test is mirror from an another test `testStopWorker()` in this
class. I agreed that it‘s a bit complicated but it's logical could ensure that we can test
the corner situation properly (the container is killed before registering successfully). TBH,
I don't know how to make sure this the corner situation can be test, I think I'm a bit fool
here... could you give me some more detail advice?

> YarnResourceManager sometimes does not request new Containers
> -------------------------------------------------------------
>                 Key: FLINK-9190
>                 URL: https://issues.apache.org/jira/browse/FLINK-9190
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, YARN
>    Affects Versions: 1.5.0
>         Environment: Hadoop 2.8.3
> ZooKeeper 3.4.5
> Flink 71c3cd2781d36e0a03d022a38cc4503d343f7ff8
>            Reporter: Gary Yao
>            Assignee: Sihua Zhou
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>         Attachments: yarn-logs
> *Description*
> The {{YarnResourceManager}} does not request new containers if {{TaskManagers}} are killed
rapidly in succession. After 5 minutes the job is restarted due to {{NoResourceAvailableException}},
and the job runs normally afterwards. I suspect that {{TaskManager}} failures are not registered
if the failure occurs before the {{TaskManager}} registers with the master. Logs are attached;
I added additional log statements to {{YarnResourceManager.onContainersCompleted}} and {{YarnResourceManager.onContainersAllocated}}.
> *Expected Behavior*
> The {{YarnResourceManager}} should recognize that the container is completed and keep
requesting new containers. The job should run as soon as resources are available. 

This message was sent by Atlassian JIRA

View raw message