From yarn-issues-return-134844-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Fri Jan 12 23:58:05 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 59FDE180621 for ; Fri, 12 Jan 2018 23:58:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 467E3160C42; Fri, 12 Jan 2018 22:58:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8CB16160C20 for ; Fri, 12 Jan 2018 23:58:04 +0100 (CET) Received: (qmail 63288 invoked by uid 500); 12 Jan 2018 22:58:03 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 63276 invoked by uid 99); 12 Jan 2018 22:58:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Jan 2018 22:58:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0597C1808DB for ; Fri, 12 Jan 2018 22:58:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -107.911 X-Spam-Level: X-Spam-Status: No, score=-107.911 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id LueFf0tj0PeX for ; Fri, 12 Jan 2018 22:58:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 809795F297 for ; Fri, 12 Jan 2018 22:58:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id BAB51E2576 for ; Fri, 12 Jan 2018 22:58:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 709D325BD6 for ; Fri, 12 Jan 2018 22:58:00 +0000 (UTC) Date: Fri, 12 Jan 2018 22:58:00 +0000 (UTC) From: "Wangda Tan (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-7739) Revisit scheduler resource normalization behavior for max allocation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-7739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324699#comment-16324699 ] Wangda Tan commented on YARN-7739: ---------------------------------- Thanks [~jlowe], to me it is also a bug :). I think we should get rid of this since it could badly impact users when we have multiple resources enabled. Will talk to Vinod and keep this thread updated > Revisit scheduler resource normalization behavior for max allocation > -------------------------------------------------------------------- > > Key: YARN-7739 > URL: https://issues.apache.org/jira/browse/YARN-7739 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Wangda Tan > Priority: Critical > > Currently, YARN Scheduler normalizes requested resource based on the maximum allocation derived from configured maximum allocation and maximum registered node resources. Basically, the scheduler will silently cap asked resource by maximum allocation. > This could cause issues for applications, for example, a Spark job which needs 12 GB memory to run, however in the cluster, registered NMs have at most 8 GB mem on each node. So scheduler allocates 8GB memory container to the requested application. > Once app receives containers from RM, if it doesn't double check allocated resources, it will lead to OOM and hard to debug because scheduler silently caps maximum allocation. > When non-mandatory resources introduced, this becomes worse. For resources like GPU, we typically set minimum allocation to 0 since not all nodes have GPU devices. So it is possible that application asks 4 GPUs but get 0 GPU, it gonna be a big problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org