hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lavkesh Lahngir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
Date Tue, 30 Jun 2015 10:32:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608089#comment-14608089

Lavkesh Lahngir commented on YARN-3591:

Thanks [~jlowe] and [~zxu] for detailed analysis and reviews. 

Honestly it has become more evolved than I thought. 
Few comments:
1. I wrote a sample program to just check the penalty we will hit in terms of time. File.exists()
along with listing on the parent(initial patch) virtually adds nothing. Combined time taken
for both calls is around 0.1 ms. (This patch we applied in our production). This will just
remove the entry from the map, which will not affect the running containers. This solves the
problem of failing new containers. 
2. The latest patch which checks if the resource path exists in one of the good disks (basically
some string comparison) has major performance implications. It takes around 40 ms. No way
we could incur that.
3. If the file does not exists or it is localized on a bad disk. We need to keep track of
those as well to remove them from the disk as suggested in the Jason's comment. We can't delete
blindly from the disk if refcount is greater than one. 
Can we logically separate the original problem and related problem of zombie files and address
them in separate JIRA?

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Lavkesh Lahngir
>            Assignee: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch,
YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch
> It happens when a resource is localised on the disk, after localising that disk has gone
bad. NM keeps paths for localised resources in memory.  At the time of resource request isResourcePresent(rsrc)
will be called which calls file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns
true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because it was
able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which will call
open() natively. If the disk is good it should return an array of paths with length at-least

This message was sent by Atlassian JIRA

View raw message