pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tres Finocchiaro <tres.finocchi...@gmail.com>
Subject Re: Discrepancy between rendered and extracted characters.
Date Sat, 19 Apr 2014 20:00:51 GMT
In my experience these issues with OCR are unavoidable.  If you have
frequent issues with certain characters you could replace them with a regex
string replacement, but unfortunately to do proper context sensitive
spelling and grammatical corrections in combination with OCR would have to
be invested at the scanning end, or you would have to find a third party
plugin with reactively corrects this.

What I've seen more of is words like MEMO translating to "M  EM  0".
Spinning your own if/else helpers for the common occurrences are probably
the fastest way from the receiving end unless someone knows of a better
way... :)


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message