hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
Date Wed, 02 Sep 2015 17:38:47 GMT

    [ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727698#comment-14727698
] 

Varun Vasudev commented on YARN-3591:
-------------------------------------

Thanks for the latest patch Lavkesh! Couple of comments -
1.
Instead of 
{code}
+    this.dirsHandler = dirHandler;
{code}
in the new constructors you added, can you add that line to
{code}
LocalResourcesTrackerImpl(String user, ApplicationId appId,
      Dispatcher dispatcher,
      ConcurrentMap<LocalResourceRequest,LocalizedResource> localrsrc,
      boolean useLocalCacheDirectoryManager, Configuration conf,
      NMStateStoreService stateStore)
{code}
and have the other constructors call this one? Pass null for the directory handler if the
existing constructors are called.

2.
{code}
+      ret |= isParent(rsrc.getLocalPath().toUri().getPath(), dir);
{code}
We don't need to iterate through all the local dirs. Once ret is true we can break the loop
and return.

Rest of the patch looks good.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Lavkesh Lahngir
>            Assignee: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch,
YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch, YARN-3591.6.patch, YARN-3591.7.patch,
YARN-3591.8.patch
>
>
> It happens when a resource is localised on the disk, after localising that disk has gone
bad. NM keeps paths for localised resources in memory.  At the time of resource request isResourcePresent(rsrc)
will be called which calls file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns
true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because it was
able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which will call
open() natively. If the disk is good it should return an array of paths with length at-least
1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message