pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From A...@swmc.com
Subject Re: Can PDFBox extract text from PDF Documents that have "text boxes" ?
Date Fri, 14 Jan 2011 17:43:47 GMT
I'm not very familiar with these text boxes nor text extraction, but it 
sounds like that might be a newer feature of the PDF specifications which 
simple has not been implemented yet.  But that's just a guess.  If you can 
find out what software created those PDF files, it might help give us some 
more information.  In Adobe acrobat: File -> Properties; check the "PDF 
Producer" and "PDF Version".  If you can get the software which was used 
and create a test PDF which fails to extract text, we could look over the 
technical data and better help you figure out what's going on.

---- 
Thanks,
Adam





From:
"Lupton, Chris B." <christopher.lupton@gd-ais.com>
To:
"users@pdfbox.apache.org" <users@pdfbox.apache.org>
Date:
01/14/2011 09:19
Subject:
Can PDFBox extract text from PDF Documents that have "text boxes" ?



I have PDF Documents that have apparently been edited by some kind of PDF 
Writing Application.
When edits are made... people are adding "Text Boxes" to the Documents 
instead of just removing/editing the existing Text.
Each of the Edits have a colored boundary around them.
These 'Text Boxes' are always placed inbetween original lines of Text.

If the Document were not locked.. I could click and drag the boxes of Text 
around on the Screen.
When I mouse-over them and right-click and select Properties...
The window displayed is titled  "Text Box Properties."

When I attempt to extract text from the PDF Document...
I either get runtime exceptions from within PDFBox's API
Or.. I get Text back.. but NONE of the text from these "Text Boxes" is 
captured.


Does anyone have working sample of code that can successfully retrieve 
Text from something like this ?

I would love to provide an example, unfortunately the PDFs contain 
proprietary information so I am not allowed to do that.









- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   
Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and
Submitting Conditions  

This email and any content within or attached hereto from Sun West Mortgage Company, Inc.
is confidential and/or legally privileged. The information is intended only for the use of
the individual or entity named on this email. If you are not the intended recipient, you are
hereby notified that any disclosure, copying, distribution or taking any action in reliance
on the contents of this email information is strictly prohibited, and that the documents should
be returned to this office immediately by email. Receipt by anyone other than the intended
recipient is not a waiver of any privilege. Please do not include your social security number,
account number, or any other personal or financial information in the content of the email.
Should you have any questions, please call (800) 453 7884.  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message