hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6708) Nodemanager container crash after ext3 folder limit
Date Wed, 28 Jun 2017 15:41:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066725#comment-16066725

Jason Lowe commented on YARN-6708:

bq. If Node manager service and users are in different group should be able to check the availability
of existing cache folders during download and recovery.

Right, I was confusing this with HDFS group behavior where we'd setup the parent directory
to ensure the group of child directories is always one the NM participates in.  As you say,
we need it to be more permissive (at least 0711) so the NM will always be able to stat the
localized resource.  0755 should be fine, the parent directory has already locked the filecache
down to 710 anyway so only the user and the NM's group can get in.

bq. Already existing FSDownload code handles this case.

What I meant by that comment is it's weird that we _sometimes_ create the destDirPath leaf
directory before calling FSDownload.  If the parent is the local cache root we do not and
leave that for FSDownload to do, but if the parent isn't the root then we do create it.  We
should either always create it or never create it.  For example, I think the code should be
more like the following.  Also note the simplified logic where the whole path object is pushed
onto the stack which saves having to manipulate the path when we pop things off the stack:
      Path parent = destDirPath.getParent();
      Path cacheRoot = LocalCacheDirectoryManager.getCacheDirectoryRoot(parent);
      Stack<String> dirs = new Stack<String>();
      while (parent != null && !parent.equals(cacheRoot)) {
        parent = parent.getParent();
      // Create sub directories with same permission
      while (!dirs.isEmpty()) {
        createDir(lfs, dirs.pop(), USER_CACHE_FOLDER_PERMS);

That way we always leave it to FSDownload to create the leaf directory.  If we'd rather always
create it before calling FSDownload then we simply push destDirPath on the stack before the

bq. FSDownload should have been in nodemanager since its tightly coupled to the directoy permission
wrt to localization

With respect to permissions on that directory, yes, although those directory permissions are
a typical default and not necessarily specific to localization.  It's currently used by code
outside of the nodemanager and therefore a bit tricky to move without breaking things that
may be using it.

bq. FSDownload handles the final cache directory permissions. Even if 0/0/85 is created before
download, in FSDownload for 85 the same could get reset rt?? 

FSDownload will reset an existing cache directory permissions to 0755 iff the umask is too
restrictive to allow 0755 by default.  Either way the directory is going to be 0755 after
FSDownload is done with it, and that's all we need.

bq. The directory permission is 755 and in jenkins the umask is 022 to validate directory
rights for code change used reflection. Container localizer USERCACHE permission could be
package private but the above point of FSDownload will set the rights to 0755 or we should
be checking only 0/0??

We should only be checking 0/0.  We already know the leaf directory will be 0755 because FSDownload
does that for us.  The unit test can go ahead and verify 0/0/85 is the right permissions,
but it will be the case even before the patch.  Still probably worth it in case somehow they're
not correct.  What's critical is to test the permissions of 0/ and 0/0/.

As far as the umask is concerned, you should be able to override it by passing a FileContext
to the ContainerLocalizer where we've explicitly called setUMask on it to the desired umask
you need for the test.  We should not modify constants in classes to execute this test, especially
via reflection.  That's very confusing and could break other tests in the same test suite
that run afterwards since the classes have been scribbled upon.

Please also address the checkstyle comments in the new patch.  I'm guessing many of them will
become moot as part of the unit test rewrite.

> Nodemanager container crash after ext3 folder limit
> ---------------------------------------------------
>                 Key: YARN-6708
>                 URL: https://issues.apache.org/jira/browse/YARN-6708
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: YARN-6708.001.patch, YARN-6708.002.patch, YARN-6708.003.patch, YARN-6708.004.patch
> Configure umask as *027* for nodemanager service user
> and {{yarn.nodemanager.local-cache.max-files-per-directory}} as {{40}}. After 4  *private*
dir localization next directory will be *0/14*
> Local Directory cache manager 
> {code}
> vm2:/opt/hadoop/release/data/nmlocal/usercache/mapred/filecache # l
> total 28
> drwx--x--- 7 mapred hadoop 4096 Jun 10 14:35 ./
> drwxr-s--- 4 mapred hadoop 4096 Jun 10 12:07 ../
> drwxr-x--- 3 mapred users  4096 Jun 10 14:36 0/
> drwxr-xr-x 3 mapred users  4096 Jun 10 12:15 10/
> drwxr-xr-x 3 mapred users  4096 Jun 10 12:22 11/
> drwxr-xr-x 3 mapred users  4096 Jun 10 12:27 12/
> drwxr-xr-x 3 mapred users  4096 Jun 10 12:31 13/
> {code}
> *drwxr-x---* 3 mapred users  4096 Jun 10 14:36 0/ is only *750*
> Nodemanager user will not be able check for localization path exists or not.
> {{LocalResourcesTrackerImpl}}
> {code}
>     case REQUEST:
>       if (rsrc != null && (!isResourcePresent(rsrc))) {
>         LOG.info("Resource " + rsrc.getLocalPath()
>             + " is missing, localizing it again");
>         removeResource(req);
>         rsrc = null;
>       }
>       if (null == rsrc) {
>         rsrc = new LocalizedResource(req, dispatcher);
>         localrsrc.put(req, rsrc);
>       }
>       break;
> {code}
> *isResourcePresent* will always return false and same resource will be localized to {{0}}
to next unique number

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message