pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Magnus Landrø <stefan.lan...@gmail.com>
Subject Stream parsing huge PDF document in order to prevent memory issues
Date Fri, 14 Feb 2014 08:50:24 GMT
Hi there,

I'm trying to validate random pdfs (potentially huge - 100s of MBs)
according to the following rule set:
- Dimensions of all pages should be A4 (297 mm * 210 mm)
- There should be no content within a certain rectangular area of a page
(left margin where the print shop inserts a bar code)
- Number of pages should be less than N
- PDF version used

So far we've been using

PDDocument.load with a scratch file, but with huge documents (e.g. product
catalogues), things explode.
Is there a way to stream parse a PDF similar to stream parsing an XML
document (e.g. using StAX) and validate one page at a time?

Cheers

Stefan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message