hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4502) Sometimes Two AM containers get launched
Date Wed, 23 Dec 2015 22:56:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070289#comment-15070289
] 

Wangda Tan commented on YARN-4502:
----------------------------------

Thanks for [~yeshavora] reported this issue.

Looked at this issue with [~jianhe]/[~vinodkv], root cause of this problem is:
- After YARN-3535, all containers transition from ALLOCATED to KILLED state will be re-added
to scheduler. And such resource request will be added to *current* scheduler application attempt.
- If some containers are in ALLOCATED state and AM crashes, resource requests of these containers
could be added to *new* scheduler application attempt.
- When the new application attempt request AM container, it calls
{code}
        // AM resource has been checked when submission
        Allocation amContainerAllocation =
            appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
                Collections.singletonList(appAttempt.amReq),
                EMPTY_CONTAINER_RELEASE_LIST, null, null);
        if (amContainerAllocation != null
            && amContainerAllocation.getContainers() != null) {
          assert (amContainerAllocation.getContainers().size() == 0);
        }
{code}
Some containers could be allocated of this scheduler.allocate call, these container will be
ignored because the following *assert* is not enabled in production environment.
- So this results to some container could be possibly leaked when we allocating retried AM
containers.

*Possible fixes*:
1) Release all allocated container of {{amContainerAllocation.getContainers()}}
OR
2) Instead of using {{getCurrentAttemptForContainer}} in {{AbstractYarnScheduler#recoverResourceRequestForContainer}},
we should only recover ResourceRequest to the attempt which includes the container.

> Sometimes Two AM containers get launched
> ----------------------------------------
>
>                 Key: YARN-4502
>                 URL: https://issues.apache.org/jira/browse/YARN-4502
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Yesha Vora
>            Assignee: Wangda Tan
>            Priority: Critical
>
> Scenario : 
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
>  yarn  org.apache.hadoop.yarn.applications.distributedshell.Client -jar hadoop-yarn-applications-distributedshell-*.jar
-attempt_failures_validity_interval 60000 -shell_command "sleep 150" -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_000002
> INFO impl.TimelineClientImpl: Timeline service address: http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:<port>
> Total number of containers :2
> Container-Id                 Start Time             Finish Time                   State
                   Host       Node Http Address                                LOG-URL
> container_e12_1450825622869_0001_02_000002 Tue Dec 22 23:07:35 +0000 2015           
       N/A                 RUNNING    xxx:25454       http://xxx:8042 http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000002/hrt_qa
> container_e12_1450825622869_0001_02_000001 Tue Dec 22 23:07:34 +0000 2015           
       N/A                 RUNNING    xxx:25454       http://xxx:8042 http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000001/hrt_qa
> {code}
> * look for new AM pid 
> Here, 2nd AM container was suppose to be started on  container_e12_1450825622869_0001_02_000001.
But AM was not launched on container_e12_1450825622869_0001_02_000001. It was in AQUIRED state.

> On other hand, container_e12_1450825622869_0001_02_000002 got the AM running. 
> Expected behavior: RM should not start 2 containers for starting AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message