pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From A...@swmc.com
Subject Re: Text extraction results in strange characters
Date Thu, 23 Jun 2011 16:12:52 GMT
Dani,
The type of font being used is probably embedded and mapped to images of 
the characters.  This works great for viewing the document, but if you 
don't have characters (ASCII or Unicode), you're not going to get 
reasonable results when copying and pasting.  If my theory is correct, 
you'll find that you will also be unable to copy & paste using Adobe 
Reader.  The only way to get the text out of a file like this would be to 
convert it to an image, and then try to use ocr (optical character 
recognition) to extract the text.  As you probably already know, OCR is 
not 100% accurate, but it'd be better than nothing.

Developers,
I suggest we add this to the FAQ on the website.  I've seen it come up a 
few times, and it's a very interesting explanation.

---- 
Thanks,
Adam



From:
Daniel Sánchez González <dsanchezg@gmail.com>
To:
users@pdfbox.apache.org
Date:
06/23/2011 04:55
Subject:
Text extraction results in strange characters



When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text 
editor,
I've got the same result.

What is wrong?

Thanks in advance.

Dani





- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources   for helpful
links on Training, Webinars, Lender Alerts and Submitting Conditions  
This email and any content within or attached hereto from Sun West Mortgage Company, Inc.
is confidential and/or legally privileged. The information is intended only for the use of
the individual or entity named on this email. If you are not the intended recipient, you are
hereby notified that any disclosure, copying, distribution or taking any action in reliance
on the contents of this email information is strictly prohibited, and that the documents should
be returned to this office immediately by email. Receipt by anyone other than the intended
recipient is not a waiver of any privilege. Please do not include your social security number,
account number, or any other personal or financial information in the content of the email.
Should you have any questions, please call (800) 453 7884.  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message