pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Retter <adam.ret...@googlemail.com>
Subject Re: Memory use for large PDFs?
Date Sat, 03 Oct 2015 12:25:15 GMT
Hi Roberto,

Thanks for your suggestions, please find my responses inline below...

> Now, would you mind running your profiling using JFR and JMC. You need at
> least JDK 1.7u40, and you can enable basic flight recording using at least
> the following arguments to the JVM:
>
> -XX:+UnlockCommercialFeatures -XX:+FlightRecorder

I ran one transformation of my PDF with using the G1GC and Flight
Recorder logging for 60 seconds. I have uploaded the JFR file so that
you and others can take a look as well if you like, it is available
here - http://static.adamretter.org.uk/flight_recording_18051orgcambridgeservicepdfstampingBoot95081.jfr

It looks to me like a lot of the GC and also CPU time is spent dealing
with fonts, for the hotspot GC stuff the most active trace with 30% of
the total load is similar to:

Stack Trace TLABs Total TLAB Size(bytes) Pressure(%)
org.apache.fontbox.cmap.CMapParser.parseNextToken(PushbackInputStream)
882 329,080,552 30.228
   org.apache.fontbox.cmap.CMapParser.parse(String, InputStream) 882
329,080,552 30.228
      org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(String,
InputStream) 882 329,080,552 30.228
         org.apache.pdfbox.pdmodel.font.PDSimpleFont.extractToUnicodeEncoding()
882 329,080,552 30.228
            org.apache.pdfbox.pdmodel.font.PDSimpleFont.determineEncoding()
882 329,080,552 30.228
               org.apache.pdfbox.pdmodel.font.PDFont.<init>(COSDictionary)
882 329,080,552 30.228

org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(COSDictionary) 882
329,080,552 30.228

org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(COSDictionary) 847
313,716,808 28.817

org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(COSDictionary)
847 313,716,808 28.817

org.apache.pdfbox.pdmodel.PDResources.getFonts() 847 313,716,808
28.817

org.apache.pdfbox.pdmodel.PDResources.addFont(PDFont) 847 313,716,808
28.817

org.apache.pdfbox.pdmodel.edit.PDPageContentStream.setFont(PDFont,
float) 847 313,716,808 28.817


I only call PDTrueTypeFont#loadTTF once per document, and I only have
two true-type fonts used.
When I write the footer onto each page of my PDF, I call the following
code block once for each line I want to write onto the page (typically
3 or 4 lines):

contentStream.beginText()
contentStream.setFont(font, fontSize)
contentStream.setNonStrokingColor(color)
contentStream.moveTextPositionByAmount(position.x, newY)
contentStream.drawString(line)
contentStream.endText()

I am not sure why PDFBox spends so much time dealing with Type 0 fonts
when I am only using PDF Fonts, or why working with Fonts is so
intensive?


> You are certainly welcome to upload a sample PDF to a place and share your
> piece of code, so others can try to reproduce this. I won't be able to look
> at this for at least another week, notwithstanding that I'm very interested
> in seeing some memory and speed improvements for PDFBox.

I could post some code and a PDF, however the code for this project is
unfortunately proprietary, and also written in Scala. So I would need
to extract an example from it into a Java equivalent which would take
some time, and so I would rather do that as a last resort if I can't
get some suggestions for reducing the memory footprint.

> Last but not least, how did you run your code in parallel? Using PDFBox
> calls from threads can result in nasty surprises for some methods. Make
> sure that each thread has access to its own PDDocument object at least,
> which judging from your problem description would not make sense.

This is basically a small web service, so each request to the web
service is a new thread which has its own PDF document and instance of
the PDFBox for processing that PDF, there is nothing shared between
the threads.

> Now, I'm not entirely sure how you tackle your problem, however it could be
> worthwhile and interesting to break it down to following algorithm, which
> would allow you some sort of parallelism:
>
> 1. Split your input PDF into n (1-m) pages PDF documents using the worker
> task of each thread on the set of pages you'd like to split out. Memory
> pressure should be low on this one.

Is there some example code showing how this might be done?

> 2. Run worker threads on all PDFs found from the above step and add footer
> and save again. This should contain memory pressure if PDFBox had some
> non-linearity regarding memory usage as a function of amount of pages.

Interesting.

> 3. Merge footer-enhanced PDFs into one final PDF.

Would this not require reading all the PDFs and joining them, i.e.
wouldn't it be even more memory intensive than having a single PDF? If
not, again is there a code example for doing something like this?

> You could even consider holding all PDDocument entries in memory after
> splitting.

Are PDDocument instances light-weight compared to PDPage then?
However, to add a footer to each page, surely I would have to have
just as many PDPage instances even if they are from different
documents. In fact I would probably have more, at the moment I process
a document sequentially calling close on each PDPage when I am
finished with it. If I processed multiple pages in parallel then
surely I would expect an increase in memory use?

> Maybe this helps you pin-pointing the issue down further.

Thanks for the suggestions, it has given me some food for thought.
Hopefully the JFR trace file will enable you or one of the PDFBox
developers to suggest some improvements.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message