Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CE8D510C47 for ; Sat, 12 Dec 2015 22:38:47 +0000 (UTC) Received: (qmail 97251 invoked by uid 500); 12 Dec 2015 22:38:47 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 97177 invoked by uid 500); 12 Dec 2015 22:38:47 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 96873 invoked by uid 99); 12 Dec 2015 22:38:46 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Dec 2015 22:38:46 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id A55B92C14DC for ; Sat, 12 Dec 2015 22:38:46 +0000 (UTC) Date: Sat, 12 Dec 2015 22:38:46 +0000 (UTC) From: "john lilley (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (YARN-4449) ResourceManager can return task container with less than requested memory MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 john lilley created YARN-4449: --------------------------------- Summary: ResourceManager can return task container with less t= han requested memory Key: YARN-4449 URL: https://issues.apache.org/jira/browse/YARN-4449 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Environment: Cloudera CDH 5.4.5 Reporter: john lilley Priority: Minor Attachments: app312_rm.log Occasionally, and apparently only when more than one YARN task is running a= t once, a ResourceManager may return a container that was reserved for the = AM launch, which is smaller than the requested container size for a task. We observed this as a failure, task killed due to over-memory use. When in= vestigating, we found the following had happened: =E2=80=A2=09Client requests AM launch with 1024MB memory=20 =E2=80=A2=09RM reserves a container _000001 with 1024MB memory =E2=80=A2=09RM allocates container _000002 with 1024MB memory and launches = the AM in that =E2=80=A2=09When the AM starts requesting task containers with 2048MB memor= y, the reserved _000001 is still there, and the scheduler returns it, becau= se that=E2=80=99s what reserved containers are for. However it doesn=E2=80= =99t check that the reserved container has as much memory as being requeste= d presently. This seems to be a timing problem and occurs erratically. Sorry I could no= t try this on a newer cluster because it is so hard to reproduce. However,= you can see in our AM's log where it asks for 2000MB and gets 1024MB: 2015-12-09 02:41:10 INFO net.redpoint.yarnapp.ApplicationMaster: TaskLaunch= er.run: ** STARTING CONTAINER ** task =3D Task['([...] containerRequest=3DCapability[]Priority[0], container=3Dcontainer_1446677679275_0312_01_000001, state= =3Dnew, result=3Dnull, diagnostics=3D'null', retries=3D0] container =3D Container: [ContainerId: container_1446677679275_0312_01_00= 0001, NodeId: rpb-cdh-kerb-2.office.datalever.com:8041, NodeHttpAddress: rp= b-cdh-kerb-2.office.datalever.com:8042, Resource: , = Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.58.41:80= 41 }, ] This is probably more clear in the attached snippet of RM log, where you ca= n see this happening with appid 312 (ignore 311 which is also in there). Y= ou can see that the RM reserves one container, launches the AM in a second,= then later returns the reserved container in response to a task container = request of 2000MB, so it comes up short. This is relatively easy to work around (just reject that container and wait= for another) which is why this is minor importance. But it seems that YAR= N should give you the memory you requested, and it doesn't in this case. P= erhaps this "as designed", but it is certainly unexpected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)