Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0D3E79FC7 for ; Thu, 9 Feb 2012 20:04:48 +0000 (UTC) Received: (qmail 79722 invoked by uid 500); 9 Feb 2012 20:04:47 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 79651 invoked by uid 500); 9 Feb 2012 20:04:47 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 79639 invoked by uid 99); 9 Feb 2012 20:04:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2012 20:04:46 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2012 20:04:45 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 7EDA01A9C91 for ; Thu, 9 Feb 2012 20:04:23 +0000 (UTC) Date: Thu, 9 Feb 2012 20:04:23 +0000 (UTC) From: "Roman Shaposhnik (Commented) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1467691317.20869.1328817863521.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <125860674.77708.1327520980286.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204799#comment-13204799 ] Roman Shaposhnik commented on MAPREDUCE-3728: --------------------------------------------- Here's a more direct way to reproduce the problem. {noformat} # sudo - yarn yarn$ mkdir -p /tmp/TEST/{logs,locs} /tmp/TEST/locs/usercache yarn$ cp /tmp/cont1.tokens /tmp/TEST/cont1.tokens yarn$ container-executor rvs 0 app1 /tmp/TEST/cont1.tokens /tmp/TEST/locs /tmp/TEST/logs /usr/java/jdk1.6.0_26/jre/bin/java -classpath /usr/lib/hadoop/lib/\*:/usr/lib/hadoop/\*:/etc/hadoop/conf/nm-config/log4j.properties:/etc/hadoop/conf -Djava.library.path=/usr/lib/hadoop/lib/native org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer rvs app1 cont1 0.0.0.0 4344 /tmp/TEST/locs main : command provided 0 main : user is rvs 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 12/02/09 11:54:40 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. =========== Using localizerTokenSecurityInfo12/02/09 11:54:41 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 0 time(s). 12/02/09 11:54:42 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 1 time(s). 12/02/09 11:54:43 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 2 time(s). 12/02/09 11:54:44 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 3 time(s). 12/02/09 11:54:45 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 4 time(s). 12/02/09 11:54:46 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 5 time(s). 12/02/09 11:54:47 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 6 time(s). 12/02/09 11:54:48 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 7 time(s). 12/02/09 11:54:49 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 8 time(s). 12/02/09 11:54:50 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 9 time(s). java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:221) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:345) Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From c0506.hal.cloudera.com/172.29.81.158 to 0.0.0.0:4344 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:148) at $Proxy6.heartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:54) ... 3 more Caused by: java.net.ConnectException: Call From c0506.hal.cloudera.com/172.29.81.158 to 0.0.0.0:4344 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:686) at org.apache.hadoop.ipc.Client.call(Client.java:1141) at org.apache.hadoop.ipc.Client.call(Client.java:1100) at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:145) ... 5 more Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:488) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247) at org.apache.hadoop.ipc.Client.call(Client.java:1117) ... 7 more {noformat} as you can see the localication process got to a point where it was trying to fetch the files for localization. This means that it has completed all the filesystem manipulations. Next step would have been launching a container under the user id of 'rvs'. So lets see where that container would have put its intermediate results: {noformat} yarn$ ls -ld /tmp/TEST/locs/usercache/rvs/appcache/app1/output/ drwxr-xr-x 2 rvs yarn 4096 Feb 9 11:54 /tmp/TEST/locs/usercache/rvs/appcache/app1/output/ {noformat} quite naturally, given that the sticky bit is no longer present on the output dir AND the fact that user rvs has a group of rvs as the default group the resulting files are totally out of reach for something running under the yarn account. > ShuffleHandler can't access results when configured in a secure mode > -------------------------------------------------------------------- > > Key: MAPREDUCE-3728 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager > Affects Versions: 0.23.0 > Reporter: Roman Shaposhnik > Priority: Critical > Fix For: 0.23.1 > > > While running the simplest of jobs (Pi) on MR2 in a fully secure configuration I have noticed that the job was failing on the reduce side with the following messages littering the nodemanager logs: > {noformat} > 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_000003_0/file.out.index in any of the configured local directories > {noformat} > While digging further I found out that the permissions on the files/dirs were prohibiting nodemanager (running under the user yarn) to access these files: > {noformat} > $ ls -l /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_000001_0 > -rw-r----- 1 testuser testuser 28 Jan 20 15:41 file.out > -rw-r----- 1 testuser testuser 32 Jan 20 15:41 file.out.index > {noformat} > Digging even further revealed that the group-sticky bit that was faithfully put on all the subdirectories between testuser and application_1327102703969_0001 was gone from output and attempt_1327102703969_0001_m_000001_0. > Looking into how these subdirectories are created (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs()) > {noformat} > // $x/usercache/$user/appcache/$appId/filecache > Path appFileCacheDir = new Path(appBase, FILECACHE); > appsFileCacheDirs[i] = appFileCacheDir.toString(); > lfs.mkdir(appFileCacheDir, null, false); > // $x/usercache/$user/appcache/$appId/output > lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false); > {noformat} > Reveals that lfs.mkdir ends up manipulating permissions and thus clears sticky bit from output and filecache. > At this point I'm at a loss about how this is supposed to work. My understanding was > that the whole sequence of events here was predicated on a sticky bit set so > that daemons running under the user yarn (default group yarn) can have access > to the resulting files and subdirectories down at output and below. Please let > me know if I'm missing something or whether this is just a bug that needs to be fixed. > On a related note, when the shuffle side of the Pi job failed the job itself didn't. > It went into the endless loop and only exited when it exhausted all the local storage > for the log files (at which point the nodemanager died and thus the job ended). Perhaps > this is even more serious side effect of this issue that needs to be investigated > separately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira