pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Sánchez González <dsanch...@gmail.com>
Subject Re: Text extraction results in strange characters
Date Thu, 23 Jun 2011 16:46:44 GMT
Thank you very much for your explanation. I'll try to convert pdf to image 
and then to text via OCR. Which is the most accurate way to do this?

----- Original Message ----- 
From: <Adam@swmc.com>
To: <users@pdfbox.apache.org>
Cc: <users@pdfbox.apache.org>
Sent: Thursday, June 23, 2011 6:12 PM
Subject: Re: Text extraction results in strange characters

The type of font being used is probably embedded and mapped to images of
the characters.  This works great for viewing the document, but if you
don't have characters (ASCII or Unicode), you're not going to get
reasonable results when copying and pasting.  If my theory is correct,
you'll find that you will also be unable to copy & paste using Adobe
Reader.  The only way to get the text out of a file like this would be to
convert it to an image, and then try to use ocr (optical character
recognition) to extract the text.  As you probably already know, OCR is
not 100% accurate, but it'd be better than nothing.

I suggest we add this to the FAQ on the website.  I've seen it come up a
few times, and it's a very interesting explanation.


Daniel Sánchez González <dsanchezg@gmail.com>
06/23/2011 04:55
Text extraction results in strange characters

When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text
I've got the same result.

What is wrong?

Thanks in advance.


- FHA 203b; 203k; HECM; VA; USDA; Conventional
- Warehouse Lines; FHA-Authorized Originators
- Lending and Servicing in over 45 States
www.swmc.com   -  www.simplehecmcalculator.com   Visit 
www.swmc.com/resources   for helpful links on Training, Webinars, Lender 
Alerts and Submitting Conditions
This email and any content within or attached hereto from Sun West Mortgage 
Company, Inc. is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. 
If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or taking any action in reliance on the 
contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any 
other personal or financial information in the content of the email. Should 
you have any questions, please call (800) 453 7884. 

View raw message