lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Sekiguchi <>
Subject Re: OCR - Saving multi-term position
Date Thu, 03 Jul 2014 01:28:42 GMT
Hi Manuel,

I think OCR error correction is one of well-known NLP tasks.
I'd thought it could be implemented in the past by using Lucene.

This is a brief idea:

1. You have got a Lucene index. This existing index is made from
correct (i.e. error free) documents that are same domain of OCR documents.

2. Tokenize OCR text by ShingleTokenizer. By ShingleTokenizer, you'll get:

the quiok
tlne quick
the quick

3. Search those phrase in the existing index. I think exact search
(PhraseQuery) or FuzzyQuery can be worked. You should get the highest hit
count when searching "the quick" among those phrases.


(2014/07/02 7:19), Manuel Le Normand wrote:
> Hello,
> Many of our indexed documents are scanned and OCR'ed documents.
> Unfortunately we were not able to improve much the OCR quality (less than
> 80% word accuracy) for various reasons, a fact which badly hurts the
> retrieval quality.
> As we use an open-source OCR, we think of changing every scanned term
> output to it's main possible variations to get a higher level of confidence.
> Is there any analyser that supports this kind of need or should I make up a
> syntax and analyser of my own, i.e the payload syntax?
> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4
> Thanks,
> Manuel

View raw message