hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xi Fang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5508) Memory leak caused by unreleased FileSystem objects in JobInProgress#cleanupJob
Date Sat, 14 Sep 2013 00:06:52 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767210#comment-13767210

Xi Fang commented on MAPREDUCE-5508:

This bug was found in Microsoft's large scale test with about 200,000 job submissions. The
memory usage is steadily growing up. 

There is a long discussion between Hortonworks (thanks [~cnauroth] and [~vinodkv]) and Microsoft
on this issue. Here is the summary of the discussion.

1. The heap dumps are showing DistributedFileSystem instances that are only referred to from
the cache's HashMap entries. Since nothing else has a reference, nothing else can ever attempt
to close it, and therefore it will never be removed from the cache. 

2. The special check for "tempDirFS" (see code in description) in the patch for MAPREDUCE-5351
is intended as an optimization so that CleanupQueue doesn't need to immediately reopen a FileSystem
that was just closed. However, we observed that we're getting different identity hash code
values on the subject in the key. The code is assuming that CleanupQueue will find the same
Subject that was used inside JobInProgress. Unfortunately, this is not guaranteed, because
we may have crossed into a different access control context at this point, via UserGroupInformation#doAs.
Even though it's conceptually the same user, the Subject is a function of the current AccessControlContext:
  public synchronized
  static UserGroupInformation getCurrentUser() throws IOException {
    AccessControlContext context = AccessController.getContext();
    Subject subject = Subject.getSubject(context);
Even if the contexts are logically equivalent between JobInProgress and CleanupQueue, we see
no guarantee that Java will give you the same Subject instance, which is required for successful
lookup in the FileSystem cache (because of the use of identity hash code).

A fix is abandon this optimization and close the FileSystem within the same AccessControlContext
that opened it.  

> Memory leak caused by unreleased FileSystem objects in JobInProgress#cleanupJob
> -------------------------------------------------------------------------------
>                 Key: MAPREDUCE-5508
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5508
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 1-win
>            Reporter: Xi Fang
>            Assignee: Xi Fang
>            Priority: Critical
> MAPREDUCE-5351 fixed a memory leak problem but introducing another filesystem object
that is properly released.
> {code} JobInProgress#cleanupJob()
>   void cleanupJob() {
> ...
>           tempDirFs = jobTempDirPath.getFileSystem(conf);
>           CleanupQueue.getInstance().addToQueue(
>               new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId));
> ...
> {code}

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message