hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2795) Resource Manager fails startup with HDFS label storage and secure cluster
Date Mon, 03 Nov 2014 21:15:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195099#comment-14195099
] 

Wangda Tan commented on YARN-2795:
----------------------------------

Just tried to test in a security enabled cluster, without this patch, RM will failed to start
because we don't login before accessing HDFS.
And with this patch, RM can successfully start with labels stored on HDFS. And tried to submit
a MR job after start, it can also successfully completed as well.

> Resource Manager fails startup with HDFS label storage and secure cluster
> -------------------------------------------------------------------------
>
>                 Key: YARN-2795
>                 URL: https://issues.apache.org/jira/browse/YARN-2795
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Phil D'Amore
>            Assignee: Wangda Tan
>         Attachments: YARN-2795-20141101-1.patch, YARN-2795-20141102-1.patch, YARN-2795-20141102-2.patch
>
>
> When node labels are in use, and yarn.node-labels.fs-store.root-dir is set to a hdfs://
path, and the cluster is using kerberos, the RM fails to start while trying to unmarshal the
label store.  The following error/stack trace is observed:
> {code}
> 2014-10-31 11:55:53,807 INFO  service.AbstractService (AbstractService.java:noteFailure(272))
- Service o
> rg.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state INITED; cause:
java.io.IOExcepti
> on: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException:
GSS initiate faile
> d [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tg
> t)]; Host Details : local host is: "host.running.rm/10.0.0.34"; destination hos
> t is: "host.running.nn":8020;
> java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException:
G
> SS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism
level: Failed to fin
> d any Kerberos tgt)]; Host Details : local host is: "host.running.rm/10.0.0.34"
> ; destination host is: "host.running.nn":8020;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy14.mkdirs(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProt
> ocolTranslatorPB.java:539)
>         at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187
> )
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy15.mkdirs(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2731)
>         at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2702)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859)
>         at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1817)
>         at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.init(FileSystemNodeLabelsStore.java:87)
>         at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:206)
>         at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceInit(CommonNodeLabelsManager.java:199)
>         at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.serviceInit(RMNodeLabelsManager.java:62)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:547)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:986)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:245)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1216)
> Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any
Kerberos tgt)]
>         at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
>         at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730)
>         at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1438)
>         ... 30 more
> {code}
> I think this is a startup ordering issue, in that the scheduler is initialized before
the RM would prime the cred cache.  My reasoning is based on what happens when I don't set
the yarn.node-labels.fs-store.root-dir property, so no HDFS interaction happens when the scheduler
initializes.  Here is the relevant snippet from the log:
> {code}
> 2014-10-31 12:04:09,739 INFO  capacity.CapacityScheduler (CapacityScheduler.java:parseQueue(602))
- Initialized queu
> e: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>,
usedCapacity=0.0, absoluteUsedCa
> pacity=0.0, numApps=0, numContainers=0
> 2014-10-31 12:04:09,739 INFO  capacity.CapacityScheduler (CapacityScheduler.java:parseQueue(602))
- Initialized queu
> e: root: numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0,
vCores:0>usedCapacity=0.0, n
> umApps=0, numContainers=0
> 2014-10-31 12:04:09,742 INFO  capacity.CapacityScheduler (CapacityScheduler.java:initializeQueues(466))
- Initialize
> d root queue root: numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0,
vCores:0>usedCapac
> ity=0.0, numApps=0, numContainers=0
> 2014-10-31 12:04:09,742 INFO  capacity.CapacityScheduler (CapacityScheduler.java:initializeQueueMappings(435))
- Ini
> tialized queue mappings, override: false
> 2014-10-31 12:04:09,742 INFO  capacity.CapacityScheduler (CapacityScheduler.java:initScheduler(304))
- Initialized C
> apacityScheduler with calculator=class org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator,
minimumAlloca
> tion=<<memory:256, vCores:1>>, maximumAllocation=<<memory:2048, vCores:32>>,
asynchronousScheduling=false, asyncSche
> duleInterval=5ms
> 2014-10-31 12:04:09,866 INFO  security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(938))
- L
> ogin successful for user rm/host.running.rm@SLIDER1.EXAMPLE.COM using keytab file /etc/sec
> urity/keytabs/rm.service.keytab
> {code}
> You can see the scheduler initializes, and only then does the cred cache get primed.
 This results in a successful RM start, but of course my HDFS-backed labels are now not loaded.
> I think that if the cred cached were initialized before the scheduler, this error would
not happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message