pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan M Uys <opy...@gmail.com>
Subject Multi-threaded PDF parsing
Date Tue, 02 Dec 2014 18:41:01 GMT
Hello,

>From the FAQ about PDFBox being thread safe, it says one can have multiple
threads each accessing their own PDDocument object.

I have a question about this. Here's some pseudo Scala code I use to load a
document, find its children, then parse those children using a
PDFTextStriperByArea:

<code>
val pdf = PDDocument.load(new File("/path/to/pdf"), true)
val cos = pdf.getDocumentCatalog().getPages().getCOSObject()
val kids = cos.getDictionaryObject(COSName.KIDS).asInstanceOf[COSArray]

// some logic to iterate through the children by index i:
val kid: COSDictionary =  kids.getObject(i)
// XXX
if (COSName.PAGE.equals(kid.getDictionaryObject(COSName.TYPE))) {
  // definitely a kid. Process!
  // ...  set up the stripper, set up bounding boxes, etc ...
  stripper.extractRegions(new PDPage(kid))
}
</code>

The thing is, at point XXX I want to send the 'kid' off to another thread
for asynchronous processing.

How would I go about getting a handle on the underlying file as loaded by
PDDocument?
Do I just instantiate a PDDocument in each async thread and get the i-th
kid then process it?
I've tried this, but get "Too many open files".

Any good examples of how this can be done?

Thank you,
Juan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message