hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1801) NPE in public localizer
Date Wed, 28 May 2014 21:19:03 GMT

    [ https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011601#comment-14011601
] 

Jason Lowe commented on YARN-1801:
----------------------------------

Strictly speaking, the patch does prevent the NPE.  However the public localizer is still
effectively doomed if this condition occurs because it returns from the run() method.  That
will shutdown the localizer thread and public local resource requests will stop being processed.
 In that sense we've traded an NPE with a traceback for a one-line log message.  I'm not sure
this is an improvement, since at least the traceback is easier to notice in the NM log and
we get a corresponding fatal log when someone goes hunting for what went wrong with the public
localizer.

The real issue is we need to understand what happened to cause pending.remove(completed) to
return null.  This should never happen, and if it does then it means we have a bug.  Trying
to recover from this condition is patching a symptom rather than a root cause.  The problem
that lead to the null request event _might_ have been fixed by YARN-1575 which wasn't present
in 2.2 where the original bug occurred.  It would be interesting to know if this has reoccurred
since 2.3.0.

Assuming this is still a potential issue, we should either find a way to prevent it from ever
occurring or recover in a way that keeps the public localizer working as much as possible.
It'd be great if we could just pull from the queue and receive a structure that has both the
request event and the Future<Path> so we don't have to worry about a Future<Path>
with no associated event.  If we're going to try to recover instead, we'd have to log an error
and try to cleanup.  With no associated request event and no path if we got an execution error,
it's going to be particularly difficult to recover properly.

> NPE in public localizer
> -----------------------
>
>                 Key: YARN-1801
>                 URL: https://issues.apache.org/jira/browse/YARN-1801
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Jason Lowe
>            Assignee: Hong Zhiguo
>            Priority: Critical
>         Attachments: YARN-1801.patch
>
>
> While investigating YARN-1800 found this in the NM logs that caused the public localizer
to shutdown:
> {noformat}
> 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(651))
- Downloading public rsrc:{ hdfs://colo-2:8020/user/fertrist/oozie-oozi/0000601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
1390440382009, FILE, null }
> 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(726))
- Error: Shutting down
> java.lang.NullPointerException
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
> 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(728))
- Public cache exiting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message