pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Magnus Landrø <stefan.lan...@gmail.com>
Subject non sequential parser and 2.0
Date Tue, 27 Jan 2015 11:50:40 GMT
Hi there,

I reported an issue related to the non sequential parser in the 1.8 code
line last year (PDFBOX-1965) and was really happy to see that the issue was
recently fixed. Thanks a lot, Andreas!

I also noticed that the non sequential parser will become the default
parser in 2.0.

In my project we're using pdfbox to verify that all pages in a given pdf
can be printed by a 3rd party print service (all pages have to be A4, only
use standard fonts or embed them otherwise, have certain margins etc etc).

We noticed the document returned by getDocument() gets increasingly big
memory wise (especially if the pdf is large and complex in structure -
http://no.mouser.com/catalog/English/103/dload/pdf/mouser.pdf demonstrates
the effect well) as we iterate over all the pages in the pdf, and we free
it up gradually by doing the following in a subclass of NonSequentialParser
/ CosParser

    public PDPage getPage(int pageNr) throws IOException {
        // Free up memory regularly
        if (pageNr % 5 == 0) {
            Set<COSObjectKey> cosObjectKeys =
            for (COSObjectKey cosObjectKey : cosObjectKeys) {
        return super.getPage(pageNr);

This feels a bit like a hack - any chance this kind of functionality could
be build into pdfbox?

And, BTW, any clues when the 2.0 release will be ready? Are you planning on
shipping release candidates too (which would prevent people from having to
rely upon/distribute snapshot versions)?




TesTcl - a unit test framework for iRules

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message