hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-14492) Snapshot memory leak
Date Wed, 22 May 2019 10:40:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845759#comment-16845759
] 

Wei-Chiu Chuang edited comment on HDFS-14492 at 5/22/19 10:39 AM:
------------------------------------------------------------------

Work in progress: https://github.com/jojochuang/hadoop-common/commit/7b753be2c6a2227300cc612cf08861af7427adef

Without the fix, the fsimage that I mentioned occupies 130.9GB after deleting all snapshots.
After the fix, the heap reduces to 100.6GB after deleting all snapshots.

But still I can see around 10 million FileWithSnapshotFeature and FileDiffList lingering in
the heap. If I checkpoint and restart, the NN uses just 87.7GB heap (all FileWithSnapshotFeature
and FileDiffList are gone after restart). Of course there are some runtime stuff that got
cleaned after restart, but there are quite a few more GBs of heap waiting to be optimized.


was (Author: jojochuang):
Work in progress: https://github.com/jojochuang/hadoop-common/commit/7b753be2c6a2227300cc612cf08861af7427adef

Without the fix, the fsimage that I mentioned occupies 130.9GB after deleting all snapshots.
After the fix, the heap reduces to 100.6GB after deleting all snapshots.

But still I can see around 10 million FileWithSnapshotFeature and FileDiffList lingering in
the heap. If I checkpoint and restart, the NN uses just 87.7GB heap. Of course there are some
runtime stuff that got cleaned after restart, but there are quite a few more GBs of heap waiting
to be optimized.

> Snapshot memory leak
> --------------------
>
>                 Key: HDFS-14492
>                 URL: https://issues.apache.org/jira/browse/HDFS-14492
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 2.6.0
>         Environment: CDH5.14.4
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Major
>
> We recently examined the NameNode heap dump of a big, heavy snapshot user, trying to
trim some fat, and surely enough we found memory leak in it: when snapshots are removed,
the corresponding data structures are not removed.
> This cluster has 586 million file system objects (286 million files, 287 million blocks,
13 million directories), using around 132gb of heap.
> While only 44.5 million files have snapshotted copies, (INodeFileAttributes$SnapshotCopy),
most inodes (nearly 212 million) have FileWithSnapshotFeature and FileDiffList. Those inodes
had snapshotted copies at some point in the past, but after snapshots are removed, those data
structured are still kept in the heap.
> INode$Feature = 32.5 byte on average, FileWithSnapshotFeature = 32 bytes, FileDiffList
= 24 bytes. It may not sound a lot, but they add up quickly in large clusters like this. In
this cluster, a whopping 13.8gb of memory could have been saved:  ((32.5 + 32 + 24) bytes
* (211997769 -  44572380) =~ 13.8gb) if not for this bug. That is more than 10% of savings
in heap size.
> Heap histogram for reference:
> {noformat}
> num #instances #bytes class name
>  ----------------------------------------------
>  1: 286418254 27496152384 org.apache.hadoop.hdfs.server.namenode.INodeFile
>  2: 287322227 18388622528 org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>  3: 227899550 17144816120 [B
>  4: 287324031 13769408616 [Lorg.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo;
>  5: 71352116 12353841568 [Ljava.lang.Object;
>  6: 286322650 9170335840 [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>  7: 235632329 7658462416 [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>  8: 4 7046430816 [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>  9: 211997769 6783928608 org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature
>  10: 211997769 5087946456 org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiffList
>  11: 76586261 3780468856 [I
>  12: 44572380 3209211360 org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy
>  13: 58634517 2345380680 java.util.ArrayList
>  14: 44572380 2139474240 org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiff
>  15: 76582416 1837977984 org.apache.hadoop.hdfs.server.namenode.AclFeature
>  16: 12907668 1135874784 org.apache.hadoop.hdfs.server.namenode.INodeDirectory{noformat}
> [~szetszwo] [~arpaga] [~smeng] [~shashikant]  any thoughts?
> I am thinking that inside 
> AbstractINodeDiffList#deleteSnapshotDiff() , in addition to cleaning up file diffs, it
should also remove FileWithSnapshotFeature. I am not familiar with the snapshot implementation,
so any guidance is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message