hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode
Date Thu, 01 Mar 2012 12:42:02 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219993#comment-13219993
] 

Hudson commented on MAPREDUCE-3728:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #971 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/971/])
    MAPREDUCE-3728. ShuffleHandler can't access results when configured in a secure mode (ahmed
via tucu) (Revision 1295245)

     Result = SUCCESS
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java

                
> ShuffleHandler can't access results when configured in a secure mode
> --------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3728
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Roman Shaposhnik
>            Assignee: Ahmed Radwan
>            Priority: Critical
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3728.patch
>
>
> While running the simplest of jobs (Pi) on MR2 in a fully secure configuration I have
noticed that the job was failing on the reduce side with the following messages littering
the nodemanager logs:
> {noformat}
> 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_000003_0/file.out.index
in any of the configured local directories
> {noformat}
> While digging further I found out that the permissions on the files/dirs were prohibiting
nodemanager (running under the user yarn) to access these files:
> {noformat}
> $ ls -l /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_000001_0
> -rw-r----- 1 testuser testuser 28 Jan 20 15:41 file.out
> -rw-r----- 1 testuser testuser 32 Jan 20 15:41 file.out.index
> {noformat}
> Digging even further revealed that the group-sticky bit that was faithfully put on all
the subdirectories between testuser and application_1327102703969_0001 was gone from output
and attempt_1327102703969_0001_m_000001_0. 
> Looking into how these subdirectories are created (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
> {noformat}
>       // $x/usercache/$user/appcache/$appId/filecache
>       Path appFileCacheDir = new Path(appBase, FILECACHE);
>       appsFileCacheDirs[i] = appFileCacheDir.toString();
>       lfs.mkdir(appFileCacheDir, null, false);
>       // $x/usercache/$user/appcache/$appId/output
>       lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
> {noformat}
> Reveals that lfs.mkdir ends up manipulating permissions and thus clears sticky bit from
output and filecache.
> At this point I'm at a loss about how this is supposed to work. My understanding was
> that the whole sequence of events here was predicated on a sticky bit set so
> that daemons running under the user yarn (default group yarn) can have access
> to the resulting files and subdirectories down at output and below. Please let
> me know if I'm missing something or whether this is just a bug that needs to be fixed.
> On a related note, when the shuffle side of the Pi job failed the job itself didn't.
> It went into the endless loop and only exited when it exhausted all the local storage
> for the log files (at which point the nodemanager died and thus the job ended). Perhaps
> this is even more serious side effect of this issue that needs to be investigated 
> separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message