Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6212B18350 for ; Fri, 26 Jun 2015 01:26:05 +0000 (UTC) Received: (qmail 43838 invoked by uid 500); 26 Jun 2015 01:26:05 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 43783 invoked by uid 500); 26 Jun 2015 01:26:05 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 43767 invoked by uid 99); 26 Jun 2015 01:26:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Jun 2015 01:26:05 +0000 Date: Fri, 26 Jun 2015 01:26:05 +0000 (UTC) From: "Varun Saxena (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-3850) NM fails to read files from full disks which can lead to container logs being lost and other issues MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3850: ------------------------------- Description: *Container logs* can be lost if disk has become full(~90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. In addition to this, there are 2 more issues : # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so NM will fail to serve up logs from full disks from its web interfaces. # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full disks so it is possible that on container recovery, PID file is not found. was: *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. In addition to this, there are 2 more issues : # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so NM will fail to serve up logs from full disks from its web interfaces. # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full disks so it is possible that on container recovery, PID file is not found. > NM fails to read files from full disks which can lead to container logs being lost and other issues > --------------------------------------------------------------------------------------------------- > > Key: YARN-3850 > URL: https://issues.apache.org/jira/browse/YARN-3850 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager > Affects Versions: 2.7.0 > Reporter: Varun Saxena > Assignee: Varun Saxena > Priority: Blocker > Attachments: YARN-3850.01.patch, YARN-3850.02.patch > > > *Container logs* can be lost if disk has become full(~90% full). > When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. > But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. > So we are left with neither aggregated logs for the app nor the individual container logs for the app. > In addition to this, there are 2 more issues : > # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so NM will fail to serve up logs from full disks from its web interfaces. > # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full disks so it is possible that on container recovery, PID file is not found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)