lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject Cleaning up dirty OCR
Date Tue, 09 Mar 2010 19:31:27 GMT
Hello all,

We have been indexing a large collection of OCR'd text. About 5 million books in over 200
languages.  With 1.5 billion OCR'd pages, even a small OCR error rate creates a relatively
large number of meaningless unique terms.  (See  http://www.hathitrust.org/blogs/large-scale-search/too-many-words
)

We would like to remove some *fraction* of these nonsense words caused by OCR errors prior
to indexing. ( We don't want to remove "real" words, so we need some method with very few
false positives.)

A dictionary based approach does not seem feasible given the number of languages and the inclusion
of proper names, place names, and technical terms.   We are considering using some heuristics,
such as looking for strings over a certain length or strings containing more than some number
of punctuation characters.

This paper has a few such heuristics:
Kazem Taghva, Tom Nartker, Allen Condit, and Julie Borsack. Automatic Removal of ``Garbage
Strings'' in OCR Text: An Implementation. In The 5th World Multi-Conference on Systemics,
Cybernetics and Informatics, Orlando, Florida, July 2001. http://www.isri.unlv.edu/publications/isripub/Taghva01b.pdf

Can anyone suggest any practical solutions to removing some fraction of the tokens containing
OCR errors from our input stream?

Tom Burton-West
University of Michigan Library
www.hathitrust.org


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message