hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Saxena (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
Date Tue, 23 Jun 2015 20:20:43 GMT

    [ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598299#comment-14598299
] 

Varun Saxena commented on YARN-3793:
------------------------------------

[~kasha], I think I know whats happening.
When disks become bad(say due to disk full), there is a problem when uploading container logs.

In {{AppLogAggregatorImpl#doContainerLogAggregation}} only good log directories are considered
for log aggregation. This leads to {{AggregatedLogFormat#getPendingLogFilesToUploadForThisContainer}}
returning no log files to be uploaded.

The caller of {{doContainerLogAggregation}} is {{AppLogAggregatorImpl#uploadLogsForContainers}}
which as can be seen under will call {{DeletionService#delete}}. If {{uploadedFilePathsInThisCycle}}
is empty *(which will be if disks are full)*, this will lead to both sub directory and base
directories being null. This explains the NPEs' being thrown.
When these deletion tasks are stored in state store, they will be stored with nulls as well
and this can explain why it happens on recovery as well.
{code}
      boolean uploadedLogsInThisCycle = false;
      for (ContainerId container : pendingContainerInThisCycle) {
        ContainerLogAggregator aggregator = null;
        if (containerLogAggregators.containsKey(container)) {
          aggregator = containerLogAggregators.get(container);
        } else {
          aggregator = new ContainerLogAggregator(container);
          containerLogAggregators.put(container, aggregator);
        }
        Set<Path> uploadedFilePathsInThisCycle =
            aggregator.doContainerLogAggregation(writer, appFinished);
        if (uploadedFilePathsInThisCycle.size() > 0) {
          uploadedLogsInThisCycle = true;
        }
        this.delService.delete(this.userUgi.getShortUserName(), null,
          uploadedFilePathsInThisCycle
            .toArray(new Path[uploadedFilePathsInThisCycle.size()]));
       ......
   }
{code}

Log aggregation should consider full disks as well otherwise there will be nothing to be aggregated
if disks are full. Anyways log aggregation would lead to deletion of local logs.

I verified the occurrence of this issue via TestLogAggregationService#testLocalFileDeletionAfterUpload
by making good log directories return nothing.


> Several NPEs when deleting local files on NM recovery
> -----------------------------------------------------
>
>                 Key: YARN-3793
>                 URL: https://issues.apache.org/jira/browse/YARN-3793
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery. These seem
to correspond to sub-directories that need to be deleted. I wonder if null pointers here mean
incorrect tracking of these resources and a potential leak. This JIRA is to investigate and
fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Deleting absolute path : null
> 2015-05-18 07:06:10,224 ERROR org.apache.hadoop.yarn.server.nodemanager.DeletionService:
Exception during execution of task in DeletionService
> java.lang.NullPointerException
>         at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
>         at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
>         at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
>         at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message