pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Magnus Landrø <stefan.lan...@gmail.com>
Subject Re: Stream parsing huge PDF document in order to prevent memory issues
Date Thu, 06 Mar 2014 14:39:24 GMT
Hi there,

So I tried using the NonSequentialParser setting the
org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property to
true.

The memory footprint looks much better, however, I can't get the individual
pages due to a NPE in the getPage code.

It turns out the resDict below is mostly null - which again causes a NPE in
parseDictObjects.

Should I file a bug?

Stefan


    public PDPage getPage(int pageNr) throws IOException
    {
        getPagesObject();

        // ---- get list of top level pages
        COSArray kids = (COSArray)
pagesDictionary.getDictionaryObject(COSName.KIDS);

        if (kids == null)
        {
            throw new IOException("Missing 'Kids' entry in pages
dictionary.");
        }

        // ---- get page we are looking for (possibly going recursively into
        // subpages)
        COSObject pageObj = getPageObject(pageNr, kids, 0);

        if (pageObj == null)
        {
            throw new IOException("Page " + pageNr + " not found.");
        }

        // ---- parse all objects necessary to load page.
        COSDictionary pageDict = (COSDictionary) pageObj.getObject();

        if (parseMinimalCatalog && (!allPagesParsed))
        {
            // parse page resources since we did not do this on start
            COSDictionary resDict = (COSDictionary)
pageDict.getDictionaryObject(COSName.RESOURCES);
            parseDictObjects(resDict);
        }

        return new PDPage(pageDict);
    }



2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sahyoun@fileaffairs.de>:

> Hi,
>
> PDF is a random access format with key information (the Cross Reference
> where to find the objects) being at the end of the file and the PDF objects
> spread around the file.
>
> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
> instead of PDDocument.load and set the system property
> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which does
> a minimal parsing of the PDF. That could reduce the memory consumption a
> little bit.  Unfortunately once an object has been parsed it’s content
> stays in memory so you would need to do a low level parsing yourself with
> the information available from the initial parsing stage.
>
> Maruan Sahyoun
>
> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
> stefan.landro@gmail.com>:
>
> > Hi there,
> >
> > I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> > according to the following rule set:
> > - Dimensions of all pages should be A4 (297 mm * 210 mm)
> > - There should be no content within a certain rectangular area of a page
> > (left margin where the print shop inserts a bar code)
> > - Number of pages should be less than N
> > - PDF version used
> >
> > So far we've been using
> >
> > PDDocument.load with a scratch file, but with huge documents (e.g.
> product
> > catalogues), things explode.
> > Is there a way to stream parse a PDF similar to stream parsing an XML
> > document (e.g. using StAX) and validate one page at a time?
> >
> > Cheers
> >
> > Stefan
>
>


-- 
BEKK Open
http://open.bekk.no

TesTcl - a unit test framework for iRules
http://testcl.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message