pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Xref parsing performance
Date Fri, 27 Feb 2015 15:59:15 GMT
looked at it quickly - very nice!
 
Maruan

Am 27.02.2015 um 16:34 schrieb Andrea Vacondio <andrea.vacondio@gmail.com>:

> Hi,
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> I'm trying to get familiar with the PDFBox code and I decided to try and
> fix this here https://github.com/torakiki/sambox/tree/xref
> As you can see I refactored a bit extracting some classes and covered the
> expect behaviour with unit tests. I tested it with few random docs, loading
> and saving them back and the output is exactly the same with or without my
> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
> this
> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> it takes half the time. Other kind of docs loads in a comparable amount of
> time and even profiling memory usage it seems comparable if not a little
> less.
> Maybe someone wants to take a look?
> 
> I understand my changes look a bit invasive and the issue could probably be
> fixed differently, on the other hand the couple BaseParser+COSParser looks
> like a big intimidating monster to a newcomer like me and it's quite
> difficult to follow the expected behaviour so I thought this might be a
> chance to start breaking them down in smaller, distilled classes...
> something a little more manageable and testable... anyway, grab what you
> like, leave what you don't  :)


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message