hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4354) Public resource localization fails with NPE
Date Fri, 13 Nov 2015 18:35:11 GMT

    [ https://issues.apache.org/jira/browse/YARN-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004462#comment-15004462

Jason Lowe commented on YARN-4354:

Looks like this can cause nodemanagers to crash as well:
2015-11-13 17:22:51,063 [AsyncDispatcher event handler] FATAL event.AsyncDispatcher: Error
in dispatcher thread
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.getPathForLocalization(LocalResourcesTrackerImpl.java:448)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:802)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:704)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:646)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
        at java.lang.Thread.run(Thread.java:745)

 I think it was trying to lookup a resource that it assumed was still there but had been removed.

bq. I think a check for resource visibility should suffice. What do you think ?

What worries me about that approach is if we somehow allowed a heartbeat from a localizer
to come in just after we cleaned up a resource because a container happened to be released
then we get the same kind of badness if the localization completed just after we removed it.
 We may still want a null check just in case we get a late event for a removed resource.

> Public resource localization fails with NPE
> -------------------------------------------
>                 Key: YARN-4354
>                 URL: https://issues.apache.org/jira/browse/YARN-4354
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.2
>            Reporter: Jason Lowe
>            Priority: Blocker
>         Attachments: YARN-4354-unittest.patch
> I saw public localization on nodemanagers get stuck because it was constantly rejecting
requests to the thread pool executor.

This message was sent by Atlassian JIRA

View raw message