Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D27D5187D2 for ; Fri, 9 Oct 2015 20:11:06 +0000 (UTC) Received: (qmail 37949 invoked by uid 500); 9 Oct 2015 20:11:06 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 37909 invoked by uid 500); 9 Oct 2015 20:11:06 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 37894 invoked by uid 99); 9 Oct 2015 20:11:06 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Oct 2015 20:11:06 +0000 Date: Fri, 9 Oct 2015 20:11:06 +0000 (UTC) From: "Jason Lowe (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951115#comment-14951115 ] Jason Lowe commented on YARN-2902: ---------------------------------- The key with the new proposal is that the LocalizerRunner thread is the one issuing the deletes and only after the localizer process exits. bq. 2. Container is killed. Associated resources are stuck in downloading state and a deletion task is launched for them. That deletion task would not be launched because the localizer has not exited. The LocalizerRunner will still be waiting on the localizer process to exit. bq. Localizer doesnt exit immediately as of now when container is killed, even though we interrupt the thread. Yes, and that's fine. We won't issue the deletion requests until the localizer process eventually exits. The key is this code in LocalizerRunner: {code:title=LocalizerRunner} public void run() { Path nmPrivateCTokensPath = null; Throwable exception = null; try { [...localizer pre-startup code removed for brevity...] if (dirsHandler.areDisksHealthy()) { exec.startLocalizer(new LocalizerStartContext.Builder() .setNmPrivateContainerTokens(nmPrivateCTokensPath) .setNmAddr(localizationServerAddress) .setUser(context.getUser()) .setAppId(ConverterUtils.toString(context.getContainerId() .getApplicationAttemptId().getApplicationId())) .setLocId(localizerId) .setDirsHandler(dirsHandler) .build()); } else { throw new IOException("All disks failed. " + dirsHandler.getDisksHealthReport(false)); } // TODO handle ExitCodeException separately? } catch (FSError fe) { exception = fe; } catch (Exception e) { exception = e; } finally { if (exception != null) { LOG.info("Localizer failed", exception); // On error, report failure to Container and signal ABORT // Notify resource of failed localization ContainerId cId = context.getContainerId(); dispatcher.getEventHandler().handle(new ContainerResourceFailedEvent( cId, null, exception.getMessage())); } for (LocalizerResourceRequestEvent event : scheduled.values()) { event.getResource().unlock(); } delService.delete(null, nmPrivateCTokensPath, new Path[] {}); } {code} startLocalizer won't return until the localizer process exits, so when it iterates the {{scheduled}} map in the finally block to unlock the resources we can issue deletions for the local resource paths at the same time. > Killing a container that is localizing can orphan resources in the DOWNLOADING state > ------------------------------------------------------------------------------------ > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.5.0 > Reporter: Jason Lowe > Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)