pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lupton, Chris B." <christopher.lup...@gd-ais.com>
Subject Can PDFBox extract text from PDF Documents that have "text boxes" ?
Date Fri, 14 Jan 2011 16:58:55 GMT
I have PDF Documents that have apparently been edited by some kind of PDF Writing Application.
When edits are made... people are adding "Text Boxes" to the Documents instead of just removing/editing
the existing Text.
Each of the Edits have a colored boundary around them.
These 'Text Boxes' are always placed inbetween original lines of Text.

If the Document were not locked.. I could click and drag the boxes of Text around on the Screen.
When I mouse-over them and right-click and select Properties...
The window displayed is titled  "Text Box Properties."

When I attempt to extract text from the PDF Document...
I either get runtime exceptions from within PDFBox's API
Or.. I get Text back.. but NONE of the text from these "Text Boxes" is captured.

Does anyone have working sample of code that can successfully retrieve Text from something
like this ?

I would love to provide an example, unfortunately the PDFs contain proprietary information
so I am not allowed to do that.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message