pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Stream parsing huge PDF document in order to prevent memory issues
Date Fri, 07 Mar 2014 06:11:06 GMT
Hi Stefan,

just fine. If I need more information I’ll let you know.

BR
Maruan Sahyoun

Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <stefan.landro@gmail.com>:

> Hi Maruan,
> 
> So I created a small maven project containing a PDF-file I just generated
> on my mac, and pushed it to https://github.com/landro/pdfboxbug
> I could create a zip and upload to your bugtracker, but that feels kinda
> awkward.
> What do you prefer?
> 
> Stefan
> 
> 
> 
> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <sahyoun@fileaffairs.de>:
> 
>> Yes please, file a bug report together with a sample PDF and sample code
>> to reproduce the issue. Which PDFBox version are you using?
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
>> stefan.landro@gmail.com>:
>> 
>>> Hi there,
>>> 
>>> So I tried using the NonSequentialParser setting the
>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property
>> to
>>> true.
>>> 
>>> The memory footprint looks much better, however, I can't get the
>> individual
>>> pages due to a NPE in the getPage code.
>>> 
>>> It turns out the resDict below is mostly null - which again causes a NPE
>> in
>>> parseDictObjects.
>>> 
>>> Should I file a bug?
>>> 
>>> Stefan
>>> 
>>> 
>>>   public PDPage getPage(int pageNr) throws IOException
>>>   {
>>>       getPagesObject();
>>> 
>>>       // ---- get list of top level pages
>>>       COSArray kids = (COSArray)
>>> pagesDictionary.getDictionaryObject(COSName.KIDS);
>>> 
>>>       if (kids == null)
>>>       {
>>>           throw new IOException("Missing 'Kids' entry in pages
>>> dictionary.");
>>>       }
>>> 
>>>       // ---- get page we are looking for (possibly going recursively
>> into
>>>       // subpages)
>>>       COSObject pageObj = getPageObject(pageNr, kids, 0);
>>> 
>>>       if (pageObj == null)
>>>       {
>>>           throw new IOException("Page " + pageNr + " not found.");
>>>       }
>>> 
>>>       // ---- parse all objects necessary to load page.
>>>       COSDictionary pageDict = (COSDictionary) pageObj.getObject();
>>> 
>>>       if (parseMinimalCatalog && (!allPagesParsed))
>>>       {
>>>           // parse page resources since we did not do this on start
>>>           COSDictionary resDict = (COSDictionary)
>>> pageDict.getDictionaryObject(COSName.RESOURCES);
>>>           parseDictObjects(resDict);
>>>       }
>>> 
>>>       return new PDPage(pageDict);
>>>   }
>>> 
>>> 
>>> 
>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sahyoun@fileaffairs.de>:
>>> 
>>>> Hi,
>>>> 
>>>> PDF is a random access format with key information (the Cross Reference
>>>> where to find the objects) being at the end of the file and the PDF
>> objects
>>>> spread around the file.
>>>> 
>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
>>>> instead of PDDocument.load and set the system property
>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
>> does
>>>> a minimal parsing of the PDF. That could reduce the memory consumption a
>>>> little bit.  Unfortunately once an object has been parsed it’s content
>>>> stays in memory so you would need to do a low level parsing yourself
>> with
>>>> the information available from the initial parsing stage.
>>>> 
>>>> Maruan Sahyoun
>>>> 
>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
>>>> stefan.landro@gmail.com>:
>>>> 
>>>>> Hi there,
>>>>> 
>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
>>>>> according to the following rule set:
>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
>>>>> - There should be no content within a certain rectangular area of a
>> page
>>>>> (left margin where the print shop inserts a bar code)
>>>>> - Number of pages should be less than N
>>>>> - PDF version used
>>>>> 
>>>>> So far we've been using
>>>>> 
>>>>> PDDocument.load with a scratch file, but with huge documents (e.g.
>>>> product
>>>>> catalogues), things explode.
>>>>> Is there a way to stream parse a PDF similar to stream parsing an XML
>>>>> document (e.g. using StAX) and validate one page at a time?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> Stefan
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> BEKK Open
>>> http://open.bekk.no
>>> 
>>> TesTcl - a unit test framework for iRules
>>> http://testcl.com
>> 
>> 
> 
> 
> -- 
> BEKK Open
> http://open.bekk.no
> 
> TesTcl - a unit test framework for iRules
> http://testcl.com


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message