hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Saxena (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
Date Thu, 25 Jun 2015 20:21:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601865#comment-14601865

Varun Saxena commented on YARN-3793:

While NPEs' are a problem, on close look at the code shows that there is a bigger problem
here and that is *container logs can be lost* if disk has become bad(become 90% full).

When application finishes,  we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}.
But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}}
which in case of disk full would return nothing. So none of the container logs are aggregated
and uploaded.
But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}.
This deletes the application directory which contains container logs. This is because it calls
{{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well.

So we are left with neither aggregated logs for the app nor the individual container logs
for the app.

This sounds like a critical if not a blocker. [~kasha], [~jlowe], can you have a look ? I
will upload a patch shortly.

> Several NPEs when deleting local files on NM recovery
> -----------------------------------------------------
>                 Key: YARN-3793
>                 URL: https://issues.apache.org/jira/browse/YARN-3793
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Karthik Kambatla
>            Assignee: Varun Saxena
> When NM work-preserving restart is enabled, we see several NPEs on recovery. These seem
to correspond to sub-directories that need to be deleted. I wonder if null pointers here mean
incorrect tracking of these resources and a potential leak. This JIRA is to investigate and
fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Deleting absolute path : null
> 2015-05-18 07:06:10,224 ERROR org.apache.hadoop.yarn.server.nodemanager.DeletionService:
Exception during execution of task in DeletionService
> java.lang.NullPointerException
>         at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
>         at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
>         at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
>         at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}

This message was sent by Atlassian JIRA

View raw message