hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lujie (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (YARN-8649) Similar as YARN-4355:NPE while processing localizer heartbeat
Date Tue, 21 Aug 2018 15:34:00 GMT

     [ https://issues.apache.org/jira/browse/YARN-8649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

lujie reassigned YARN-8649:

      Assignee: lujie
    Attachment: YARN-8649.patch

Hi [~jlowe], [~pradeepambati],[~$iddhe$h]

I have restudied the bug according the logs.

*The root cause:*
 # When NM shutdowns, it will sent KILL_CONTAINER to the Container, The log has shown this

2018-08-21 20:11:08,316 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
Container container_1534853453424_0001_01_000001 transitioned from LOCALIZING to KILLING
this will led the KillBeforeRunningTransition to execute.
 # In KillBeforeRunningTransition, it will call "container.cleanup()", and in "cleanup" function,
it will sent "ContainerLocalizationCleanupEvent".
 # ContainerLocalizationCleanupEvent will cause the ResourceLocalizationService.handleCleanupContainerResources
to execute, and in "handleCleanupContainerResources", it  will send  "ResourceReleaseEvent".
 # ResourceReleaseEvent will led cause the LocalResourcesTrackerImpl.handle to execute, and
in handle(at line 199in source code) it will call removeResouce:

if (event.getType() == ResourceEventType.RELEASE) {
    if (rsrc.getState() == ResourceState.DOWNLOADING &&
        rsrc.getRefCount() <= 0 &&
        rsrc.getRequest().getVisibility() != LocalResourceVisibility.PUBLIC) {

 # in removeResouce, it will do:

LocalizedResource rsrc = localrsrc.remove(req);

 # when heartbeat come in, the LocalResourcesTrackerImpl.getPathForLocalization will  do:

Path localPath = new Path(rPath, req.getPath().getName());
LocalizedResource rsrc = localrsrc.get(req);//rsec is null
NPE happens!

*Unit test:*

While fixing YARN-4355, the patch added the test "testLocalizerHeartbeatWhenAppCleaningUp"
in Class "TestResourceLocalizationService"

In the test, it also send the "ContainerLocalizationCleanupEvent", but the test doesn't 
cover that heartbeat can comes at this moment.

In this patch, we change the "testLocalizerHeartbeatWhenAppCleaningUp" to cover this situation.
This change will trigger the bug.



When we fix the NPE, we only add null check, i think it is suitable here!

> Similar as YARN-4355:NPE while processing localizer heartbeat
> -------------------------------------------------------------
>                 Key: YARN-8649
>                 URL: https://issues.apache.org/jira/browse/YARN-8649
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Major
>         Attachments: YARN-8649.patch, hadoop-hires-nodemanager-hadoop11.log
> I have noticed that a nodemanager was getting NPEs while tearing down. The reason maybe 
similar to YARN-4355 which is reported by [# Jason Lowe]. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message