pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank van der Hulst <drifter.fr...@gmail.com>
Subject Re: Multi-threaded PDF parsing
Date Tue, 02 Dec 2014 19:49:52 GMT
Hi Juan,
I have multiple threads reading PDF files, although not trying to process
kids asynchronously. Here's some excerpts. I guess that to get the
underlying PDDocument handle, you should extend PDFTextStripperByArea.

Frank

<code>
  public static Thread parsePDF() {
    return new Thread(() -> {
      List<String[]> result;
      try {
        result = new PDFTableStripper().parse(sourceFilename);
       } catch (IOException ex) {
        log.fatal("Failed to read " + sourceFilename);
        log.fatal(ex.getMessage(), ex);
        return;
      }

      // Process result here
      log.info("Finished");
    }, "Parse PDF");
  }

public class PDFTableStripper extends PDFTextStripper {
  private ArrayList<String[]> result;

  public PDFTableStripper() throws IOException {
    super();
    setLineSeparator("\n");
    setPageEnd("\f");
    result = new ArrayList<>(0);
  }

  /**
   * Convenience method to parse the specified PDF file.
   *
   * @param filepath Full path to file
   * @return List of String[]s containing one String[] entry for each row
in the table. Each
   * row is an array of Strings, with one entry for each column in the
table.
   * @throws IOException
   */
   public List<String[]> parse(String filepath) throws IOException {
    document = PDDocument.load(filepath);
    getText(document);
    document.close();
    return result;
  }
}
</code>


On Wed, Dec 3, 2014 at 7:41 AM, Juan M Uys <opyate@gmail.com> wrote:

> Hello,
>
> From the FAQ about PDFBox being thread safe, it says one can have multiple
> threads each accessing their own PDDocument object.
>
> I have a question about this. Here's some pseudo Scala code I use to load a
> document, find its children, then parse those children using a
> PDFTextStriperByArea:
>
> <code>
> val pdf = PDDocument.load(new File("/path/to/pdf"), true)
> val cos = pdf.getDocumentCatalog().getPages().getCOSObject()
> val kids = cos.getDictionaryObject(COSName.KIDS).asInstanceOf[COSArray]
>
> // some logic to iterate through the children by index i:
> val kid: COSDictionary =  kids.getObject(i)
> // XXX
> if (COSName.PAGE.equals(kid.getDictionaryObject(COSName.TYPE))) {
>   // definitely a kid. Process!
>   // ...  set up the stripper, set up bounding boxes, etc ...
>   stripper.extractRegions(new PDPage(kid))
> }
> </code>
>
> The thing is, at point XXX I want to send the 'kid' off to another thread
> for asynchronous processing.
>
> How would I go about getting a handle on the underlying file as loaded by
> PDDocument?
> Do I just instantiate a PDDocument in each async thread and get the i-th
> kid then process it?
> I've tried this, but get "Too many open files".
>
> Any good examples of how this can be done?
>
> Thank you,
> Juan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message