pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Manes (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4396) Memory leak due to soft reference caching
Date Thu, 06 Dec 2018 19:58:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711955#comment-16711955
] 

Ben Manes commented on PDFBOX-4396:
-----------------------------------

The process completed for one of the large uploads and I had to disable the others due to
taking too long (hours). The cpu overhead on the machine caused bad user-facing latencies,
since the scheduler doesn't take cpu into account and those jobs were being delayed. I
think since our use cases expanded expecting 5-10 page documents to now many thousands of
pages (monthly historicals), it's no longer a good fit to do the work on a single process, shared
with other user-facing work. I think my next step should be to migrate this use-case to a
lambda, distribute page ranges, and invoke in parallel. That could easily be distributed
using pdfbox and work great, but it's probably easier / faster / cheaper to use ghostscript
for such a simple lambda task.

The documents are not encrypted so I think that case may not apply. In my code I often pass
around a Guava Closer to accumulate resources across methods, and then ensure all are closed
if not done so otherwise. If everything is associated to a document, it would make sense for
a closer to be propagated from it and then it can close all of the resources (if not closed
already). That could be a custom utility, etc. of course rather than Guava's.

You might also considered using weak / phantom references instead of finalization. For my
application's file I/O (local and s3), I give clients a session with their own tempdir and
reference count downloaded files against a global cache. The session handles are proxies
that clients should close, but held in a weak keyed cache where the actual implementation
is the value. Then when the proxy is collected, the strong-ref value is explicitly closed.
This acts as a safety net just in case, since we do a lot of I/O and this form of reference caching
is cheap. The same can be done better with phantom references, but more work than spinning
up a weak cache with a removal listener. From reading the code, it looks like a lot of effort
was made to close resources but it also got really complex with patches for the inevitable
leaks. Of course, you might not be able to change much due to API compatibility needs.

I think at this point I'll close this, like the other, as not something trivially fixable.
I do think better resource handing is warranted, but that requires a thoughtful refactor.

> Memory leak due to soft reference caching
> -----------------------------------------
>
>                 Key: PDFBOX-4396
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4396
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.12
>         Environment: JDK10; G1
>            Reporter: Ben Manes
>            Priority: Major
>         Attachments: #2 - memory leak 2.png, #2 - memory leak.png, memory leak 2.png,
memory leak.png
>
>
> In a heap dump, it appears that DefaultResourceCache is retaining 5.3 GB of memory due
to buffered images (via PDImageXObject). I suspect that G1 is not collecting soft references
across all regions before it out-of-memory errors.
> In PDFBOX-4389, I discovered very slow PDDocument#load times due to a JDK10 I/O bug.
Previously I was loading the document to render each page, but this took 1.5 minutes. To work
around that bug I reused the document instance across pages. This seems to have fail because
the pages were cached and not cleared by the GC.
> The DefaultResourceCache does not prune its cache entries when the soft references are
collected. Like WeakHashMap, it should use a ReferenceQueue, poll it on every access, and
prune accordingly.
> Thankfully PDDocument#setResourceCache exists. For now I am going to reset the cache
to a new instance after a page has been rendered. The entries should no longer be reachable
and be GC'd more aggressively. If that doesn't work, I'll either replace the cache (e.g. with
Caffeine) or disable it by setting the instance to null.
> I think the desired fix is to prune the DefaultResourceCache and, ideally, reconsider
usage of soft references (as they tend to be poor in practice). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message