hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "john lilley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-4449) ResourceManager can return task container with less than requested memory
Date Sat, 12 Dec 2015 22:38:46 GMT

     [ https://issues.apache.org/jira/browse/YARN-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

john lilley updated YARN-4449:
------------------------------
    Attachment: app312_rm.log

> ResourceManager can return task container with less than requested memory
> -------------------------------------------------------------------------
>
>                 Key: YARN-4449
>                 URL: https://issues.apache.org/jira/browse/YARN-4449
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>         Environment: Cloudera CDH 5.4.5
>            Reporter: john lilley
>            Priority: Minor
>         Attachments: app312_rm.log
>
>
> Occasionally, and apparently only when more than one YARN task is running at once, a
ResourceManager may return a container that was reserved for the AM launch, which is smaller
than the requested container size for a task.
> We observed this as a failure, task killed due to over-memory use.  When investigating,
we found the following had happened:
> •	Client requests AM launch with 1024MB memory 
> •	RM reserves a container _000001 with 1024MB memory
> •	RM allocates container _000002 with 1024MB memory and launches the AM in that
> •	When the AM starts requesting task containers with 2048MB memory, the reserved _000001
is still there, and the scheduler returns it, because that’s what reserved containers are
for.  However it doesn’t check that the reserved container has as much memory as being requested
presently.
> This seems to be a timing problem and occurs erratically.  Sorry I could not try this
on a newer cluster because it is so hard to reproduce.  However, you can see in our AM's log
where it asks for 2000MB and gets 1024MB:
> 2015-12-09 02:41:10 INFO net.redpoint.yarnapp.ApplicationMaster: TaskLauncher.run: **
STARTING CONTAINER **
>   task = Task['([...] containerRequest=Capability[<memory:2000, vCores:0>]Priority[0],
container=container_1446677679275_0312_01_000001, state=new, result=null, diagnostics='null',
retries=0]
>   container = Container: [ContainerId: container_1446677679275_0312_01_000001, NodeId:
rpb-cdh-kerb-2.office.datalever.com:8041, NodeHttpAddress: rpb-cdh-kerb-2.office.datalever.com:8042,
Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken,
service: 192.168.58.41:8041 }, ]
> This is probably more clear in the attached snippet of RM log, where you can see this
happening with appid 312 (ignore 311 which is also in there).  You can see that the RM reserves
one container, launches the AM in a second, then later returns the reserved container in response
to a task container request of 2000MB, so it comes up short.
> This is relatively easy to work around (just reject that container and wait for another)
which is why this is minor importance.  But it seems that YARN should give you the memory
you requested, and it doesn't in this case.  Perhaps this "as designed", but it is certainly
unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message