pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zeev Sands <zeev.sa...@gmail.com>
Subject Re: Discrepancy between rendered and extracted characters.
Date Sat, 19 Apr 2014 19:49:19 GMT
On 04/19/2014 03:28 PM, Tres Finocchiaro wrote:
> @ZS,
>
> Is the text part of the original PDF or has it been created with OCR?
>
> That sounds similar to an OCR issue where the scanner that scanned in the
> document made the mistake.
>
> -Tres
>

I obtained the document from a 3rd party, so I am not sure, but looking 
at the "producer" field in it's meta data I see 'Adobe Acrobat Pro 
11.0.6 Paper Capture Plug-in'. So it appears, you are correct, the 
document might have been scanned. Ouch!

What are my options for extracting an error-free text? Using a better 
OCR software?  I have just started using pdfbox, so I haven't compiled 
any statistics on the variety or frequency of these errors,  How do 
people deal with this issue? Is it possible to write a set of rules for 
a few characters?

Thank you,
-ZS

Mime
View raw message