hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aleksandr Balitsky (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery
Date Thu, 02 Mar 2017 15:20:45 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892400#comment-15892400
] 

Aleksandr Balitsky edited comment on MAPREDUCE-6834 at 3/2/17 3:20 PM:
-----------------------------------------------------------------------

Hi [~haibochen], [~jlowe]
Sorry for late reply. 

{quote}
Is this a scenario where somehow the MRAppMaster is asking to preserve containers across app
attempts? I ask because ApplicationMasterService normally does not call setNMTokensFromPreviousAttempts
on RegisterApplicationMasterResponse unless getKeepContainersAcrossApplicationAttempts on
the application submission context is true. Last I checked the MapReduce client (YARNRunner)
wasn't specifying that when the application is submitted to YARN.
{quote}

Actually you are right. I did not consider that MR doesn't support AM work-preserving restart
and currently I see that my first patch isn't good solution for this problem. Thanks for the
review!

{quote}
Aleksandr Balitsky, which scheduler were you running?
{quote}

I'm running Fair Scheduler. I don't think that this issue depends on a scheduler, but I will
check it with another schedulers. 

{quote}
We have not made changes to preserve containers in MR. Chasing the code in more details, I
came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code
today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc
in SchedulerApplicationAttempt.java here
{quote}

I totally agree with you that we have not made changes to reserve containers in MR. But the
solution that you mentioned contradicts YARN design:
{quote}
As for network optimization, NMTokens are not sent to the ApplicationMasters for each and
every allocated container, but only for the first time or if NMTokens have to be invalidated
due to the rollover of the underlying master key
{quote}

That's so true, it is possible that a null NMToken is sent to MR. NMTokens sends only after
first creation, it's designed feature. Then it saves to NMTokenCache from AM side. It's not
necessary to pass NM tokens during each allocation interaction. So, it's not the best decision
to clear NMTokenSecretManager cache during each allocation, because it disables "cache" feature
and new NM Tokens will be generated (instead of using instance from cache) during each allocation
response. IMHO, we shouldn't do this, because it's not the fix for root cause. It looks like
workaround. 


was (Author: abalitsky1):
Hi [~haibochen], [~jlowe]
Sorry for late reply. 

{quote}
Is this a scenario where somehow the MRAppMaster is asking to preserve containers across app
attempts? I ask because ApplicationMasterService normally does not call setNMTokensFromPreviousAttempts
on RegisterApplicationMasterResponse unless getKeepContainersAcrossApplicationAttempts on
the application submission context is true. Last I checked the MapReduce client (YARNRunner)
wasn't specifying that when the application is submitted to YARN.
{quote}

Actually you are right. I did not consider that MR doesn't support AM work-preserving restart
and currently I see that my first patch isn't good solution for this problem. Thanks for the
review!

{quote}
Aleksandr Balitsky, which scheduler were you running?
{quote}

I'm running Fair Scheduler. I don't think that this issue depends on a scheduler, but I will
check it with another schedulers. 

{quote}
We have not made changes to preserve containers in MR. Chasing the code in more details, I
came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code
today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc
in SchedulerApplicationAttempt.java here
{quote}

I totally agree with you that we have not made changes to reserve containers in MR. But the
solution that you mentioned contradicts YARN design:
{quote}
As for network optimization, NMTokens are not sent to the ApplicationMasters for each and
every allocated container, but only for the first time or if NMTokens have to be invalidated
due to the rollover of the underlying master key
{quote}

That's so true that it is possible that a null NMToken is sent to MR. NMTokens sends only
after first creation, it's designed feature. Then it saves to NMTokenCache from AM side. It's
not necessary to pass NM tokens during each allocation interaction. So, it's not the best
decision to clear NMTokenSecretManager cache during each allocation, because it disables "cache"
feature and new NM Tokens will be generated (instead of using instance from cache) during
each allocation response. IMHO, we shouldn't do this, because it's not the fix for root cause.
It looks like workaround. 

> MR application fails with "No NMToken sent" exception after MRAppMaster recovery
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6834
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: resourcemanager, yarn
>    Affects Versions: 2.7.0
>         Environment: Centos 7
>            Reporter: Aleksandr Balitsky
>            Assignee: Aleksandr Balitsky
>            Priority: Critical
>         Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt and application
finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application fails with
the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Container launch failed for container_1482408247195_0002_02_000011 : org.apache.hadoop.security.token.SecretManager$InvalidToken:
No NMToken sent for node1:43037
> 	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
> 	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:244)
> 	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
> 	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
> 	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
> 	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM generates NMTokens
for new RMAppAttempt. Those new NMTokens are transmitted to RMCommunicator in RegisterApplicationMasterResponse
 (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in RMCommunicator.register
method. RM don't transmit tese tokens again for other allocated requests, but we don't have
these tokens in NMTokenCache. Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed

> I tried to do the same scenario without the commit and application completed successfully
after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message