lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Della Bitta <>
Subject Re: OCR - Saving multi-term position
Date Wed, 02 Jul 2014 14:58:43 GMT
I don't have first hand knowledge of how you implement that, but I bet a
look at the WordDelimiterFilter would help you understand how to emit
multiple terms with the same positions pretty easily.

I've heard of this "bag of word variants" approach to indexing poor-quality
OCR output before for findability reasons and I heard it works out OK.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <> | g+:
w: <>

On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <> wrote:

> Hello,
> Many of our indexed documents are scanned and OCR'ed documents.
> Unfortunately we were not able to improve much the OCR quality (less than
> 80% word accuracy) for various reasons, a fact which badly hurts the
> retrieval quality.
> As we use an open-source OCR, we think of changing every scanned term
> output to it's main possible variations to get a higher level of
> confidence.
> Is there any analyser that supports this kind of need or should I make up a
> syntax and analyser of my own, i.e the payload syntax?
> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4
> Thanks,
> Manuel

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message