pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrea Vacondio <andrea.vacon...@gmail.com>
Subject Xref parsing performance
Date Fri, 27 Feb 2015 15:34:28 GMT
few days ago I was profiling PDFBox when loading medium/large size
documents and I think I found something.
If you try loading the document
http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
it takes quite some time and that's mostly spent in the
XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
an object contained in an unparsed object stream is found, the
XrefTrailerResolver performs a full scan of the xref entries found in the
document, in this case hundreds of thousands. If the object streams are
many (like in the given doc), it performs many full scans resulting in poor
I'm trying to get familiar with the PDFBox code and I decided to try and
fix this here https://github.com/torakiki/sambox/tree/xref
As you can see I refactored a bit extracting some classes and covered the
expect behaviour with unit tests. I tested it with few random docs, loading
and saving them back and the output is exactly the same with or without my
changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
it takes half the time. Other kind of docs loads in a comparable amount of
time and even profiling memory usage it seems comparable if not a little
Maybe someone wants to take a look?

I understand my changes look a bit invasive and the issue could probably be
fixed differently, on the other hand the couple BaseParser+COSParser looks
like a big intimidating monster to a newcomer like me and it's quite
difficult to follow the expected behaviour so I thought this might be a
chance to start breaking them down in smaller, distilled classes...
something a little more manageable and testable... anyway, grab what you
like, leave what you don't  :)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message