hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aleksandr Balitsky (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-5691) RM failed Failed to load/recover state due to bad DelegationKey in RM State Store
Date Thu, 29 Sep 2016 14:00:31 GMT

     [ https://issues.apache.org/jira/browse/YARN-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aleksandr Balitsky updated YARN-5691:
-------------------------------------
    Attachment: YARN_5691_v1_001_patch.patch

> RM failed Failed to load/recover state due to bad DelegationKey in RM State Store
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-5691
>                 URL: https://issues.apache.org/jira/browse/YARN-5691
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.0, 2.7.1, 2.7.2, 2.7.3
>            Reporter: Aleksandr Balitsky
>            Priority: Minor
>         Attachments: YARN_5691_v1_001_patch.patch
>
>
> RM failed while recovery with the following error:
> 2016-09-12 21:32:21,999 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Failed to load/recover state
> java.io.EOFException
>         at java.io.DataInputStream.readByte(DataInputStream.java:267)
>         at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>         at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>         at org.apache.hadoop.security.token.delegation.DelegationKey.readFields(DelegationKey.java:110)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadRMDTSecretManagerState(FileSystemRMStateStore.java:346)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadState(FileSystemRMStateStore.java:199)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1007)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1048)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1044)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1044)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1084)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1221)
> 2016-09-12 21:32:22,002 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices
failed in state STARTED; cause: java.io.EOFException
> java.io.EOFException
>         at java.io.DataInputStream.readByte(DataInputStream.java:267)
>         at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>         at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>         at org.apache.hadoop.security.token.delegation.DelegationKey.readFields(DelegationKey.java:110)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadRMDTSecretManagerState(FileSystemRMStateStore.java:346)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadState(FileSystemRMStateStore.java:199)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1007)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1048)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1044)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1044)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1084)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1221)
> 2016-09-12 21:32:22,008 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping
ResourceManager metrics system...
> 2016-09-12 21:32:22,009 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager
metrics system stopped.
> 2016-09-12 21:32:22,009 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager
metrics system shutdown complete.
> 2016-09-12 21:32:22,010 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher
is draining to stop, igonring any new events.
> 2016-09-12 21:32:22,012 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager
failed in state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:250)
>         at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:256)
>         at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>         at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>         at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>         at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
>         at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:614)
>         at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>         at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>         at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1007)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1048)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1044)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
> It happens due to DelegationKey_45 file, which has size 0. You can easily reproduce it
by placing this file under /var/user/cluster/yarn/rm/system/FSRMStateRoot/RMDTSecretManagerRoot/
direcrory in hdfs and then restart RM.
> The solution is to add check for empty stream with DelegationKey data to prevent RM failing
during start.
> Additionally, there is method "storeRMDTMasterKeyState" in ZKRMStateStore.java that stores
DelagationKey file (file was broken (empty) in our case). This method can leave DelegationKey
file empty in case of errors in write method of DataOutputStream . There is already fixed
jira that prevents possible resource leak in this method: https://issues.apache.org/jira/browse/YARN-5663



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message