Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E4F82186AB for ; Mon, 3 Aug 2015 21:00:05 +0000 (UTC) Received: (qmail 56670 invoked by uid 500); 3 Aug 2015 21:00:05 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 56229 invoked by uid 500); 3 Aug 2015 21:00:05 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 55992 invoked by uid 99); 3 Aug 2015 21:00:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Aug 2015 21:00:05 +0000 Date: Mon, 3 Aug 2015 21:00:05 +0000 (UTC) From: "Jason Lowe (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (YARN-4011) Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-4011. ------------------------------ Resolution: Duplicate Yes, we ran into this a while ago, see YARN-2473 which was ultimately fixed by YARN-90. There were a number of other fixes as well such as YARN-3850 and YARN-3925. Closing this as a duplicate of YARN-90. As for the ability to limit disk space, this has been discussed many times before and as far back as MAPREDUCE-1100. An unsophisticated solution is to have the container monitor that already is looking for too much memory usage also monitor disk usage and kill the container if its too large. However this doesn't solve the problem for containers that are writing to locations that aren't container-specific (e.g.: where maps store their shuffle outputs, /tmp, etc. I think it could be difficult to enforce tasks that are filling the disk in arbitrary ways, but it could be straightforward to catch the task that simply is logging too much. > Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk > ------------------------------------------------------------------------ > > Key: YARN-4011 > URL: https://issues.apache.org/jira/browse/YARN-4011 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Affects Versions: 2.4.0 > Reporter: Ashwin Shankar > > We observed jobs failed since tasks couldn't launch on nodes due to "java.io.IOException No space left on device". > On digging in further, we found a rogue job which filled up disk. > Specifically it was wrote a lot of map spills(like attempt_1432082376223_461647_m_000421_0_spill_10000.out) to nm-local-dir causing disk to fill up, and it failed/got killed, but didn't clean up these files in nm-local-dir. > So the disk remained full, causing subsequent jobs to fail. > This jira is created to address why files under nm-local-dir doesn't get cleaned up when job fails after filling up disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)