pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Retter <adam.ret...@googlemail.com>
Subject Memory use for large PDFs?
Date Sat, 26 Sep 2015 20:47:37 GMT
Hi there,

I am trying to add a Footer to each page of a PDF document. My test
document is 100MB and consists of ~2000 pages.

My approach so far is similar to -


try(final PDDocument doc = PDDocument.load(pdf.asFile)) {
  final List<Page> pages = doc.getDocumentCatalog.getAllPages();

  for(final PDPage page: pages) {
    try(final PDPageContentStream stream = new
PDPageContentStream(doc, page, true, true, true)
      addMyFooter(doc, page, stream);
    }
  }

  doc.save(resultFile);
}


Processing the above with the JVM set to use "ParallelGC" and a 2GB
heap, takes 24 seconds here. Trying to run two of those operations in
parallel on the same JVM results in the threads running for more than
16 minutes after which I got bored and killed it, during that time the
CPU was being absolutely hosed by the JVM.

With a JVM set to use "ConcMarkSweepGC" and a 2GB heap, processing
takes 17 seconds. When trying to run two of these operations in
parallel on the same JVM after about 3.5 minutes I get a
java.lang.OutOfMemoryError: Java heap space.

Finally, with a JVM set to use "G1GC" and a 2GB heap processing again
takes 17 seconds. Running two of these operations in parallel on the
same JVM, causes both to complete in about 23 seconds each. Pushing
this harder, running there of these operations in parallel on the same
JVM results in a java.lang.OutOfMemoryError: Java heap space. Just
before the OOM, GC time accounts for all of the CPU time taken by the
Java process.


So what I believe is that this process seems to be generating huge
amounts of GC churn, and also uses a large amount of memory, up to 2GB
for a single 100 MB PDF document.

I don't really understand how trying to process a 100MB PDF can eat
2GB of memory, I guess many many Java objects are the culprit (at
least with regards to the GC churn).

Is PDFBox suitable for processing larger PDF documents, and if so,
what stupid thing am I doing that is eating all the RAM and destroying
performance?

Thanks Adam.


-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message