pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Retter <adam.ret...@googlemail.com>
Subject Re: Memory use for large PDFs?
Date Sat, 03 Oct 2015 10:16:52 GMT
Sorry, for the delay I was away for a week.

> a) use a scratch file PDDocument.load(File file, boolean useScratchFiles)

I could not find a load method that has a boolean parameter to
indicate whether to use scratch files. However, If I use the
PDDocument#load(File file, RandomAccess scratchFile) and specify a
scratch file then I get an Exception which occurs for every page I
process. The Exception itself doesn't seem to cause any issue as the
resulting PDF looks correct, but it is disconcerting. The stacktrace
for the thrown exception looks like:

[error] Oct 03, 2015 11:10:50 AM org.apache.pdfbox.pdmodel.font.PDFont parseCmap
[error] SEVERE: An error occurs while reading a CMap
[error] java.io.IOException: Error: expected the end of a dictionary.
[error] at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:432)
[error] at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:119)
[error] at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:626)
[error] at org.apache.pdfbox.pdmodel.font.PDSimpleFont.extractToUnicodeEncoding(PDSimpleFont.java:457)
[error] at org.apache.pdfbox.pdmodel.font.PDSimpleFont.determineEncoding(PDSimpleFont.java:411)
[error] at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:214)
[error] at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:89)
[error] at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:67)
[error] at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
[error] at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:213)
[error] at org.apache.pdfbox.pdmodel.PDResources.addFont(PDResources.java:586)
[error] at org.apache.pdfbox.pdmodel.edit.PDPageContentStream.setFont(PDPageContentStream.java:321)

> b) don't use doc.getDocumentCatalog.getAllPages() as this fetches all pages from the
document but use PDDocumentCatalog.getPages() which only gives you the root into the page
tree (drawback is that you need to do the iteration yourself). That has been enhanced in PDFBox
2.0.0 which also has an improved resource handling.

I am just wondering how I do the iteration? Are there any examples?

If I use PDDocumentCatalog#getPages() then I get a PDPageNode, but
from there it looks like I have to call PDPageNode#getKids() which
then just gives me a list of all pages, so I can not see how this
would be any more efficient, can someone explain?

Also I see that PDFBox 2.0.0 is not yet released but does have an
iterator interface on PDPageTree. Is it already stable/reliable enough
to use?

Adam Retter

skype: adam.retter
tweet: adamretter

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message