pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From britt fitch <britt.fi...@wiredinformatics.com>
Subject processPages bug?
Date Fri, 04 Dec 2015 19:08:32 GMT
Hi All, I am a committer for another Apache project (cTAKES) and have been using PDFBox in
my own application for a while now by extending PDFTextStripper and overriding processTextPosition.
I was in the process of updating to 2.0-RC2 (from 1.8) and came across a few items that seem
like they may be issues.
My apologies if this has already been discussed. I did a quick search through JIRA and nothing
was obvious.

1.
PDFTextStripper.processPages(...)
This accepts a PDPageTree as the parameter but the first line of the method is to instantiate
a new PDPageTree by calling document.getPages().
Should this just use the passed in pages parameter instead of using 2 instances of PDPageTree?

2.
The first line in processPages has a document object that is null unless you call getText()
first.
Is the correct behavior to call getText before being able to call processPages?

3.
processPage(…) doesn’t appear to do anything unless its called from processPages(…)
because currentPageNo is not set if you just call processPage(…) directly.
This method probably can’t be made private because its an override but should it either
remove the check for currentPageNo or otherwise throw an exception / log a warning?

Cheers,

Britt


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com


Mime
View raw message