hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "john lilley (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-4449) ResourceManager can return task container with less than requested memory
Date Sat, 12 Dec 2015 22:38:46 GMT
john lilley created YARN-4449:
---------------------------------

             Summary: ResourceManager can return task container with less than requested memory
                 Key: YARN-4449
                 URL: https://issues.apache.org/jira/browse/YARN-4449
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.3.0
         Environment: Cloudera CDH 5.4.5
            Reporter: john lilley
            Priority: Minor
         Attachments: app312_rm.log

Occasionally, and apparently only when more than one YARN task is running at once, a ResourceManager
may return a container that was reserved for the AM launch, which is smaller than the requested
container size for a task.

We observed this as a failure, task killed due to over-memory use.  When investigating, we
found the following had happened:
•	Client requests AM launch with 1024MB memory 
•	RM reserves a container _000001 with 1024MB memory
•	RM allocates container _000002 with 1024MB memory and launches the AM in that
•	When the AM starts requesting task containers with 2048MB memory, the reserved _000001
is still there, and the scheduler returns it, because that’s what reserved containers are
for.  However it doesn’t check that the reserved container has as much memory as being requested
presently.

This seems to be a timing problem and occurs erratically.  Sorry I could not try this on a
newer cluster because it is so hard to reproduce.  However, you can see in our AM's log where
it asks for 2000MB and gets 1024MB:

2015-12-09 02:41:10 INFO net.redpoint.yarnapp.ApplicationMaster: TaskLauncher.run: ** STARTING
CONTAINER **
  task = Task['([...] containerRequest=Capability[<memory:2000, vCores:0>]Priority[0],
container=container_1446677679275_0312_01_000001, state=new, result=null, diagnostics='null',
retries=0]
  container = Container: [ContainerId: container_1446677679275_0312_01_000001, NodeId: rpb-cdh-kerb-2.office.datalever.com:8041,
NodeHttpAddress: rpb-cdh-kerb-2.office.datalever.com:8042, Resource: <memory:1024, vCores:1>,
Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.58.41:8041 }, ]

This is probably more clear in the attached snippet of RM log, where you can see this happening
with appid 312 (ignore 311 which is also in there).  You can see that the RM reserves one
container, launches the AM in a second, then later returns the reserved container in response
to a task container request of 2000MB, so it comes up short.

This is relatively easy to work around (just reject that container and wait for another) which
is why this is minor importance.  But it seems that YARN should give you the memory you requested,
and it doesn't in this case.  Perhaps this "as designed", but it is certainly unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message