Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Sat, 16 May 2015 06:30:00 +0000 (UTC)
From: "zhihai xu (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12828009.1431003114000.133931.1431757800598@Atlassian.JIRA>
In-Reply-To: <JIRA.12828009.1431003114000@Atlassian.JIRA>
References: <JIRA.12828009.1431003114000@Atlassian.JIRA>
 <JIRA.12828009.1431003114360@arcas>
Subject: [jira] [Commented] (YARN-3591) Resource Localisation on a bad disk
 causes subsequent containers failure
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546590#comment-14546590 ] 

zhihai xu commented on YARN-3591:
---------------------------------

[~lavkesh], Currently DirectoryCollection supports {{fullDirs}} and {{errorDirs}}. Both are not good dirs. IMO {{fullDirs}} is the disk which can become good when the localized files are deleted by above cache-clean-up and {{errorDirs}} is the corrupted disk which can't become good until somebody fix it manually. Calling removeResource for localized resource in {{errorDirs}} sounds reasonable to me.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch
>
>
> It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory.  At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)