pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From A...@swmc.com
Subject RE: Can PDFBox extract text from PDF Documents that have "text boxes" ?
Date Mon, 17 Jan 2011 16:50:31 GMT
Seeing as how you ran into this, then there are surely more files out 
there which have this same problem, so I'd like to see PDFBox to be able 
to handle these files.  If you can get a small sample PDF which doesn't 
have any confidential information on it, we should be able to look into it 
(perhaps from your 3rd party provider?).

I'm not really familiar with the text extraction code, but I could look 
into a runtime exception given a stacktrace and the PDF to replicate the 
issue.  Just open an issue on JIRA, if you haven't already, and post the 
stacktrace and when you get a sample PDF, attach that as well.

---- 
Thanks,
Adam



From:
"Lupton, Chris B." <christopher.lupton@gd-ais.com>
To:
"users@pdfbox.apache.org" <users@pdfbox.apache.org>
Date:
01/17/2011 06:35
Subject:
RE: Can PDFBox extract text from PDF Documents that have "text boxes" ?



Thanks for the tip about checking the File -> Properties.

Apparently the software that is generating and/or editing these PDF 
Documents originates from  "www.activepdf.com"

I can use that information at least try and follow-up with that 3rd party 
provider and see if there are any options for getting alternate versions
Of those Documents that don't contain these annotations.

As a follow-up note:
The version for PDF is listed as  (PDF Version 1.5  Acrobat 6.x)



-----Original Message-----
From: Adam@swmc.com [mailto:Adam@swmc.com] 
Sent: Friday, January 14, 2011 12:44 PM
To: users@pdfbox.apache.org
Cc: users@pdfbox.apache.org
Subject: Re: Can PDFBox extract text from PDF Documents that have "text 
boxes" ?

I'm not very familiar with these text boxes nor text extraction, but it 
sounds like that might be a newer feature of the PDF specifications which 
simple has not been implemented yet.  But that's just a guess.  If you can 

find out what software created those PDF files, it might help give us some 

more information.  In Adobe acrobat: File -> Properties; check the "PDF 
Producer" and "PDF Version".  If you can get the software which was used 
and create a test PDF which fails to extract text, we could look over the 
technical data and better help you figure out what's going on.

---- 
Thanks,
Adam





From:
"Lupton, Chris B." <christopher.lupton@gd-ais.com>
To:
"users@pdfbox.apache.org" <users@pdfbox.apache.org>
Date:
01/14/2011 09:19
Subject:
Can PDFBox extract text from PDF Documents that have "text boxes" ?



I have PDF Documents that have apparently been edited by some kind of PDF 
Writing Application.
When edits are made... people are adding "Text Boxes" to the Documents 
instead of just removing/editing the existing Text.
Each of the Edits have a colored boundary around them.
These 'Text Boxes' are always placed inbetween original lines of Text.

If the Document were not locked.. I could click and drag the boxes of Text 

around on the Screen.
When I mouse-over them and right-click and select Properties...
The window displayed is titled  "Text Box Properties."

When I attempt to extract text from the PDF Document...
I either get runtime exceptions from within PDFBox's API
Or.. I get Text back.. but NONE of the text from these "Text Boxes" is 
captured.


Does anyone have working sample of code that can successfully retrieve 
Text from something like this ?

I would love to provide an example, unfortunately the PDFs contain 
proprietary information so I am not allowed to do that.









- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com 
Visit  www.swmc.com/resources   for helpful links on Training, Webinars, 
Lender Alerts and Submitting Conditions 

This email and any content within or attached hereto from Sun West 
Mortgage Company, Inc. is confidential and/or legally privileged. The 
information is intended only for the use of the individual or entity named 
on this email. If you are not the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or taking any action 
in reliance on the contents of this email information is strictly 
prohibited, and that the documents should be returned to this office 
immediately by email. Receipt by anyone other than the intended recipient 
is not a waiver of any privilege. Please do not include your social 
security number, account number, or any other personal or financial 
information in the content of the email. Should you have any questions, 
please call (800) 453 7884. 



- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources   for helpful
links on Training, Webinars, Lender Alerts and Submitting Conditions  
This email and any content within or attached hereto from Sun West Mortgage Company, Inc.
is confidential and/or legally privileged. The information is intended only for the use of
the individual or entity named on this email. If you are not the intended recipient, you are
hereby notified that any disclosure, copying, distribution or taking any action in reliance
on the contents of this email information is strictly prohibited, and that the documents should
be returned to this office immediately by email. Receipt by anyone other than the intended
recipient is not a waiver of any privilege. Please do not include your social security number,
account number, or any other personal or financial information in the content of the email.
Should you have any questions, please call (800) 453 7884.  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message