pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Multi-threaded PDF parsing
Date Tue, 02 Dec 2014 19:33:05 GMT
Hi Juan,

> Hello,
> 
> From the FAQ about PDFBox being thread safe, it says one can have multiple
> threads each accessing their own PDDocument object.

Yes, that’s right.

> I have a question about this. Here's some pseudo Scala code I use to load a
> document, find its children, then parse those children using a
> PDFTextStriperByArea:
> 
> <code>
> val pdf = PDDocument.load(new File("/path/to/pdf"), true)
> val cos = pdf.getDocumentCatalog().getPages().getCOSObject()
> val kids = cos.getDictionaryObject(COSName.KIDS).asInstanceOf[COSArray]
> 
> // some logic to iterate through the children by index i:
> val kid: COSDictionary =  kids.getObject(i)
> // XXX
> if (COSName.PAGE.equals(kid.getDictionaryObject(COSName.TYPE))) {
>  // definitely a kid. Process!
>  // ...  set up the stripper, set up bounding boxes, etc ...
>  stripper.extractRegions(new PDPage(kid))
> }
> </code>
> 
> The thing is, at point XXX I want to send the 'kid' off to another thread
> for asynchronous processing.

You can’t do this with PDFBox, each PDDocument and any of its kids, etc.
must be processed in the same thread.

> How would I go about getting a handle on the underlying file as loaded by
> PDDocument?
> Do I just instantiate a PDDocument in each async thread and get the i-th
> kid then process it?

You could create a PDDocument for the same file, in each thread, which
would cause the file to be opened and parsed independently for each ‘kid’.
This isn’t likely to result in good performance, as you’re re-parsing the PDF
over and over and also bypassing PDDocument’s caching mechanisms.

If you really want to do this, you could use a thread pool where each thread
holds an open PDDocument to the same PDF file and is given a list of page
numbers to process before closing the document. So it’s possible, but it’s
not simple.

> I've tried this, but get "Too many open files”.

Maybe you forget to close the PDDocument or another file stream.

> Any good examples of how this can be done?

My advice is not to do this, if you want parallel processing then you’re
much better off doing this on a per-document basis rather than a per-
page basis.

> Thank you,
> Juan

— John


Mime
View raw message