hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xi Fang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5508) JobTracker memory leak caused by unreleased FileSystem objects in JobInProgress#cleanupJob
Date Sat, 14 Sep 2013 07:44:53 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767401#comment-13767401

Xi Fang commented on MAPREDUCE-5508:

[~sandyr] Thanks for your comments.

bq. Have you tested this fix.

Yes. We have tested this fix on our test cluster (about 130,000 submission). After the workflow
was done, we waited for a couple of minutes (jobs were retiring), then forced GC, and then
dumped the memory. We manually checked the FileSystem#Cache. There was no memory leak.

bq. For your analysis 

1. I agree with "it doesn't appear that tempDirFs and fs are ever even ending up equal because
tempDirFs is created with the wrong UGI."  
2. I think tempDir would be fine because  1) JobInProgess#cleanupJob won't introduce a file
system instance for tempDir and 2) the fs in CleanupQueue@deletePath would be reused (i.e.
only one instance would exist in FileSystem#Cache). My initial thought was this part has a
memory leak. But a test shows that there is no problem here.
3. The problem is actually 
tempDirFs = jobTempDirPath.getFileSystem(conf);
The problem here is that this guy "MAY" (I will explain later) put a new entry in FileSystem#Cache.
Note that this would eventually go into UserGroupInformation#getCurrentUser to get a UGI with
a current AccessControlContext.  CleanupQueue#deletePath won't close this entry because a
different UGI (i.e. "userUGI" created in JobInProgress) is used there. Here is the tricky
part which we had a long discussion with [~cnauroth] and [~vinodkv]. The problem here is that
although we may only have one current user, the following code "MAY" return different subjects.
 static UserGroupInformation getCurrentUser() throws IOException {
    AccessControlContext context = AccessController.getContext();
-->    Subject subject = Subject.getSubject(context);   -------------------------< 
Because the entry of FileSystem#Cache uses identityHashCode of a subject to construct the
key, a file system object created by  "jobTempDirPath.getFileSystem(conf)" may not be found
later when this code is executed again, although we may have the same principle (i.e. the
current user). This eventually leads to an unbounded number of file system instances in FileSystem#Cache.
Nothing is going to remove them from the cache.
Please let me know if you have any questions. 
> JobTracker memory leak caused by unreleased FileSystem objects in JobInProgress#cleanupJob
> ------------------------------------------------------------------------------------------
>                 Key: MAPREDUCE-5508
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5508
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 1-win, 1.2.1
>            Reporter: Xi Fang
>            Assignee: Xi Fang
>            Priority: Critical
>         Attachments: MAPREDUCE-5508.patch
> MAPREDUCE-5351 fixed a memory leak problem but introducing another filesystem object
(see "tempDirFs") that is not properly released.
> {code} JobInProgress#cleanupJob()
>   void cleanupJob() {
> ...
>           tempDirFs = jobTempDirPath.getFileSystem(conf);
>           CleanupQueue.getInstance().addToQueue(
>               new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId));
> ...
>  if (tempDirFs != fs) {
>       try {
>         fs.close();
>       } catch (IOException ie) {
> ...
> }
> {code}

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message