Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C3C98200CCC for ; Fri, 21 Jul 2017 18:25:04 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C27F716D6FA; Fri, 21 Jul 2017 16:25:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 13C7B16D6F6 for ; Fri, 21 Jul 2017 18:25:03 +0200 (CEST) Received: (qmail 72205 invoked by uid 500); 21 Jul 2017 16:25:03 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 72188 invoked by uid 99); 21 Jul 2017 16:25:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Jul 2017 16:25:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id BDBC0C0535 for ; Fri, 21 Jul 2017 16:25:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id aT7CmVP0cebO for ; Fri, 21 Jul 2017 16:25:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id AA16B5FD38 for ; Fri, 21 Jul 2017 16:25:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 2416DE0E08 for ; Fri, 21 Jul 2017 16:25:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3B9DB21EE9 for ; Fri, 21 Jul 2017 16:25:00 +0000 (UTC) Date: Fri, 21 Jul 2017 16:25:00 +0000 (UTC) From: "Jason Lowe (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-6846) Nodemanager can fail to fully delete application local directories when applications are killed MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 21 Jul 2017 16:25:05 -0000 [ https://issues.apache.org/jira/browse/YARN-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-6846: ----------------------------- Attachment: YARN-6846.001.patch Attaching a patch that makes the container-executor more tolerant of paths being already deleted when trying to delete a hierarchy. It also changes the deletion code to be best-effort by attempting to delete other entries even if unlinking one of the entries encountered an error. > Nodemanager can fail to fully delete application local directories when applications are killed > ----------------------------------------------------------------------------------------------- > > Key: YARN-6846 > URL: https://issues.apache.org/jira/browse/YARN-6846 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.8.1 > Reporter: Jason Lowe > Priority: Critical > Attachments: YARN-6846.001.patch > > > When an application is killed all of the running containers are killed and the app waits for the containers to complete before cleaning up. As each container completes the container directory is deleted via the DeletionService. After all containers have completed the app completes and the app directory is deleted. If the app completes quickly enough then the deletion of the container and app directories can race against each other. If the container deletion executor deletes a file just before the application deletion executor then it can cause the application deletion executor to fail, leaving the remaining entries in the application directory lingering. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org