pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Memory use for large PDFs?
Date Tue, 29 Sep 2015 08:16:55 GMT
Hi,

there are two ways to optimize that.

a) use a scratch file PDDocument.load(File file, boolean useScratchFiles)
b) don't use doc.getDocumentCatalog.getAllPages() as this fetches all pages from the document
but use PDDocumentCatalog.getPages() which only gives you the root into the page tree (drawback
is that you need to do the iteration yourself). That has been enhanced in PDFBox 2.0.0 which
also has an improved resource handling.

BR
Maruan

> Am 28.09.2015 um 21:11 schrieb Roberto Nibali <rnibali@gmail.com>:
> 
> Hi
> 
> This is an interesting observation. I'd be quite interested in following
> this up since I've also seen extraordinary high gc trashing when running my
> tool threaded accessing an abstracted high-level API of PDFBox. Judging
> from the brief amount of time I have spent reading the PDFBox source code,
> I believe it was written with stability in mind rather than speed. Having
> said that, though, I'm not exactly qualified to make such statements, since
> I'm merely a user of PDFBox.
> 
> Now, would you mind running your profiling using JFR and JMC. You need at
> least JDK 1.7u40, and you can enable basic flight recording using at least
> the following arguments to the JVM:
> 
> -XX:+UnlockCommercialFeatures -XX:+FlightRecorder
> 
> 
> The overhead of this kind of instrumenation is rather low (1%-2% of
> additional runtime CPU I/O), even for high rates of sampling and deep stack
> traces. Reading your post, I assume you're technically fit enough, so there
> is no further need to explain this kind of instrumentation.
> 
> In the past, I have found that using JFR instrumenting has given me much
> better insights into such performance issues under memory pressure and the
> stack trace sampling is done beautifully. It's not quite as user-friendly
> and versatile as Yourkit, but it does its job. Flight recording does
> however not account for CPU load, so don't look at latencies.
> 
> You are certainly welcome to upload a sample PDF to a place and share your
> piece of code, so others can try to reproduce this. I won't be able to look
> at this for at least another week, notwithstanding that I'm very interested
> in seeing some memory and speed improvements for PDFBox.
> 
> Last but not least, how did you run your code in parallel? Using PDFBox
> calls from threads can result in nasty surprises for some methods. Make
> sure that each thread has access to its own PDDocument object at least,
> which judging from your problem description would not make sense.
> 
> Now, I'm not entirely sure how you tackle your problem, however it could be
> worthwhile and interesting to break it down to following algorithm, which
> would allow you some sort of parallelism:
> 
> 1. Split your input PDF into n (1-m) pages PDF documents using the worker
> task of each thread on the set of pages you'd like to split out. Memory
> pressure should be low on this one.
> 2. Run worker threads on all PDFs found from the above step and add footer
> and save again. This should contain memory pressure if PDFBox had some
> non-linearity regarding memory usage as a function of amount of pages.
> 3. Merge footer-enhanced PDFs into one final PDF.
> 
> You could even consider holding all PDDocument entries in memory after
> splitting.
> 
> Maybe this helps you pin-pointing the issue down further.
> 
> Best regards
> Roberto
> 
> On Sat, Sep 26, 2015 at 10:47 PM, Adam Retter <adam.retter@googlemail.com>
> wrote:
> 
>> Hi there,
>> 
>> I am trying to add a Footer to each page of a PDF document. My test
>> document is 100MB and consists of ~2000 pages.
>> 
>> My approach so far is similar to -
>> 
>> 
>> try(final PDDocument doc = PDDocument.load(pdf.asFile)) {
>>  final List<Page> pages = doc.getDocumentCatalog.getAllPages();
>> 
>>  for(final PDPage page: pages) {
>>    try(final PDPageContentStream stream = new
>> PDPageContentStream(doc, page, true, true, true)
>>      addMyFooter(doc, page, stream);
>>    }
>>  }
>> 
>>  doc.save(resultFile);
>> }
>> 
>> 
>> Processing the above with the JVM set to use "ParallelGC" and a 2GB
>> heap, takes 24 seconds here. Trying to run two of those operations in
>> parallel on the same JVM results in the threads running for more than
>> 16 minutes after which I got bored and killed it, during that time the
>> CPU was being absolutely hosed by the JVM.
>> 
>> With a JVM set to use "ConcMarkSweepGC" and a 2GB heap, processing
>> takes 17 seconds. When trying to run two of these operations in
>> parallel on the same JVM after about 3.5 minutes I get a
>> java.lang.OutOfMemoryError: Java heap space.
>> 
>> Finally, with a JVM set to use "G1GC" and a 2GB heap processing again
>> takes 17 seconds. Running two of these operations in parallel on the
>> same JVM, causes both to complete in about 23 seconds each. Pushing
>> this harder, running there of these operations in parallel on the same
>> JVM results in a java.lang.OutOfMemoryError: Java heap space. Just
>> before the OOM, GC time accounts for all of the CPU time taken by the
>> Java process.
>> 
>> 
>> So what I believe is that this process seems to be generating huge
>> amounts of GC churn, and also uses a large amount of memory, up to 2GB
>> for a single 100 MB PDF document.
>> 
>> I don't really understand how trying to process a 100MB PDF can eat
>> 2GB of memory, I guess many many Java objects are the culprit (at
>> least with regards to the GC churn).
>> 
>> Is PDFBox suitable for processing larger PDF documents, and if so,
>> what stupid thing am I doing that is eating all the RAM and destroying
>> performance?
>> 
>> Thanks Adam.
>> 
>> 
>> --
>> Adam Retter
>> 
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message