pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Manes (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4396) Memory leak due to soft reference caching
Date Wed, 05 Dec 2018 17:41:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710391#comment-16710391

Ben Manes commented on PDFBOX-4396:

Yes, I agree caching makes sense in general. My case is extreme due to N thousand page documents
from scanned paperwork, taking 3-13 seconds per page for PdfBox to render into an image. While
I'd appreciate better performance, that's only if it retains stability.

I agree weak references are not a fit here and did not intend to imply otherwise. My point
is that the cache held 430k HashMap.Entry objects where many might have null values. This
can be pruned by using a ReferenceQueue, something like the below code.

Soft references are problematic and typically chosen because a developer doesn't know a good
size. Instead of a strict limit, the decision is left to the JVM. The references are in a
global cache, so an inexpensive cache might cause a critical one to be flushed. The collection
behavior is GC specific and the penalty is placed in the critical section of the pause time. Many collectors are
not aggressive, which increases hit rates but the memory pressure causes full GCs in short
intervals. A collector that is aggressive makes the cache ineffective.

If there is a way to estimate the size, then a bounded cache is preferrable. This avoids the
above problems with the potential of higher hit rates, as LRU can easily to polluted. See
for example [Caffeine's hit rates|https://github.com/ben-manes/caffeine/wiki/Efficiency] by
taking frequency into account, or our new [research paper|https://drive.google.com/file/d/1CT2ASkfuG9qVya9Sn8ZUCZjrFSSyjRA_/view?usp=sharing] for
an adapting policy. If the number of entries or weight of an entry can be estimated then a
strong reference cache is typically the preferred approach. If that is problematic, usually
one has to investigate off-heap caching.

So far resetting the ResourceCache has been effective. I could try amortizing that, e.g. reseting
it every N pages, to gain a little better reuse as you indicated. If I had a better sense
of the objects being cached, I would switch to a Caffeine-backed version for an explicit
bound. Can the ResourceCache be shared across documents or are the entries document specific?
final ReferenceQueue queue;
final Map<K, SoftValueReference<K, V>> cache;

public void put(K key, V value) {
  cache.put(key, new SoftValueReference<>(key, value, queue));
public V get(K key) {
  var ref = cache.get(key);
  return (ref == null) ? null : ref.get();
private void prune() {
  Reference<? extends V> ref;
  while ((ref = queue.poll()) != null) {
    var reference = (SoftValueReference<K, V>) ref;

static final class SoftValueReference<K, V> extends SoftReference<V> {
  private final K key;

  public SoftValueReference(K key, V value, ReferenceQueue<V> queue) {
    super(value, queue);
    this.key = key;
  public Object getKey() {
    return key;

> Memory leak due to soft reference caching
> -----------------------------------------
>                 Key: PDFBOX-4396
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4396
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.12
>         Environment: JDK10; G1
>            Reporter: Ben Manes
>            Priority: Major
>         Attachments: memory leak 2.png, memory leak.png
> In a heap dump, it appears that DefaultResourceCache is retaining 5.3 GB of memory due
to buffered images (via PDImageXObject). I suspect that G1 is not collecting soft references
across all regions before it out-of-memory errors.
> In PDFBOX-4389, I discovered very slow PDDocument#load times due to a JDK10 I/O bug.
Previously I was loading the document to render each page, but this took 1.5 minutes. To work
around that bug I reused the document instance across pages. This seems to have fail because
the pages were cached and not cleared by the GC.
> The DefaultResourceCache does not prune its cache entries when the soft references are
collected. Like WeakHashMap, it should use a ReferenceQueue, poll it on every access, and
prune accordingly.
> Thankfully PDDocument#setResourceCache exists. For now I am going to reset the cache
to a new instance after a page has been rendered. The entries should no longer be reachable
and be GC'd more aggressively. If that doesn't work, I'll either replace the cache (e.g. with
Caffeine) or disable it by setting the instance to null.
> I think the desired fix is to prune the DefaultResourceCache and, ideally, reconsider
usage of soft references (as they tend to be poor in practice). 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message