pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From A...@swmc.com
Subject Re: Text extraction results in strange characters
Date Thu, 23 Jun 2011 17:26:13 GMT
I've never done any OCR stuff, so I have no idea.  However, I'd just like 
to mention that one problem that I foresee if that you won't know which 
way the text is facing.  For example, is it portrait or landscape?  Is it 
rotated at 90, 180, 270 degrees?  I'm not sure how to solve this.  One 
solution would be to do the OCR 4 times (once at each rotation 0, 90, 180, 
270) and just take the "best" result (which would probably mean the 
largest amount of text).  This would use 4 times the CPU time, but I'm not 
sure what your requirements are.  Maybe it's very important to go fast, or 
maybe you don't care how long it takes as long as the results are the best 
possible.

I've also heard that reducing the colors can help.  For example, instead 
of having greyscale, convert it to use 1-bit pixels (either black or 
white).  This will make sure all the edges are sharp and most OCR 
algorithms will work better that way.  Of course, this could backfire 
severely if the text is a light shade of grey (as the entire image would 
be converted to white), if the text is in a light color (yellow, light 
blue, etc.), or if the background is a dark color (green text on a black 
background, for instance).  Again here you could do analysis on the image 
to try to detect the right filters to run on the image (invert colors so 
you have dark text of a light background, color saturation, contrast, 
etc.) and you could run the same image through OCR with multiple different 
filters and take the best result.  It's just a matter of how creative you 
want to get, how much CPU power you have to work with, how much 
development time you have, and how important it is that the results are as 
close to perfect as possible.

But like I said, I've never actually done any OCR myself, so maybe the OCR 
libraries out there already take some/most/all of this into account. There 
might be someone else on this list who has experience and can provide some 
advice.  If not, check with the developers of OCR libs; I'm sure they'll 
have many good suggestions :-)

---- 
Thanks,
Adam





From:
Daniel Sánchez González <dsanchezg@gmail.com>
To:
<users@pdfbox.apache.org>
Date:
06/23/2011 09:47
Subject:
Re: Text extraction results in strange characters



Thank you very much for your explanation. I'll try to convert pdf to image 

and then to text via OCR. Which is the most accurate way to do this?




----- Original Message ----- 
From: <Adam@swmc.com>
To: <users@pdfbox.apache.org>
Cc: <users@pdfbox.apache.org>
Sent: Thursday, June 23, 2011 6:12 PM
Subject: Re: Text extraction results in strange characters


Dani,
The type of font being used is probably embedded and mapped to images of
the characters.  This works great for viewing the document, but if you
don't have characters (ASCII or Unicode), you're not going to get
reasonable results when copying and pasting.  If my theory is correct,
you'll find that you will also be unable to copy & paste using Adobe
Reader.  The only way to get the text out of a file like this would be to
convert it to an image, and then try to use ocr (optical character
recognition) to extract the text.  As you probably already know, OCR is
not 100% accurate, but it'd be better than nothing.

Developers,
I suggest we add this to the FAQ on the website.  I've seen it come up a
few times, and it's a very interesting explanation.

---- 
Thanks,
Adam



From:
Daniel Sánchez González <dsanchezg@gmail.com>
To:
users@pdfbox.apache.org
Date:
06/23/2011 04:55
Subject:
Text extraction results in strange characters



When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text
editor,
I've got the same result.

What is wrong?

Thanks in advance.

Dani





- FHA 203b; 203k; HECM; VA; USDA; Conventional
- Warehouse Lines; FHA-Authorized Originators
- Lending and Servicing in over 45 States
www.swmc.com   -  www.simplehecmcalculator.com   Visit 
www.swmc.com/resources   for helpful links on Training, Webinars, Lender 
Alerts and Submitting Conditions
This email and any content within or attached hereto from Sun West 
Mortgage 
Company, Inc. is confidential and/or legally privileged. The information 
is 
intended only for the use of the individual or entity named on this email. 

If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or taking any action in reliance on the 
contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt 
by 
anyone other than the intended recipient is not a waiver of any privilege. 

Please do not include your social security number, account number, or any 
other personal or financial information in the content of the email. 
Should 
you have any questions, please call (800) 453 7884. 




- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources   for helpful
links on Training, Webinars, Lender Alerts and Submitting Conditions  
This email and any content within or attached hereto from Sun West Mortgage Company, Inc.
is confidential and/or legally privileged. The information is intended only for the use of
the individual or entity named on this email. If you are not the intended recipient, you are
hereby notified that any disclosure, copying, distribution or taking any action in reliance
on the contents of this email information is strictly prohibited, and that the documents should
be returned to this office immediately by email. Receipt by anyone other than the intended
recipient is not a waiver of any privilege. Please do not include your social security number,
account number, or any other personal or financial information in the content of the email.
Should you have any questions, please call (800) 453 7884.  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message