lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hull <char...@flax.co.uk>
Subject Re: OCR - Saving multi-term position
Date Thu, 03 Jul 2014 07:43:09 GMT
On 02/07/2014 15:19, Manuel Le Normand wrote:
> Hello,
> Many of our indexed documents are scanned and OCR'ed documents.
> Unfortunately we were not able to improve much the OCR quality (less than
> 80% word accuracy) for various reasons, a fact which badly hurts the
> retrieval quality.
>
> As we use an open-source OCR, we think of changing every scanned term
> output to it's main possible variations to get a higher level of confidence.
>
> Is there any analyser that supports this kind of need or should I make up a
> syntax and analyser of my own, i.e the payload syntax?
>
> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4
>
> Thanks,
> Manuel
>
Hi Manuel,

We've done something like this for several of our media monitoring 
clients. The OCR system they use (ABBYY Fine Reader I think, it's pretty 
much an industry standard) has well-known error statistics - we know the 
top N things it gets wrong, i.e. scanning 'm' as two 'n's - so we can 
implement a kind of fuzzy search without introducing too many extra terms.

It isn't quite that simple as we're doing a lot of reverse searching 
('which queries match this document') but the approach is certainly 
sound. The following talk from Lucene Revolution is about this kind of 
thing: http://www.youtube.com/watch?v=rmRCsrJp2A8

Cheers

Charlie

-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Mime
View raw message