lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: OCR - Saving multi-term position
Date Wed, 02 Jul 2014 16:28:40 GMT
Problem here is that you wind up with a zillion unique terms in your
index, which may lead to performance issues, but you probably already
know that :).

I've seen situations where running it through a dictionary helps. That
is, does each term in the OCR match some dictionary? Problem here is
that it then de-values terms that don't happen to be in the
dictionary, names for instance.

But to answer your question: No, there really isn't a pre-built
analysis chain that i know of that does this. Root issue is how to
assign "confidence"? No clue for your specific domain.

So payloads seem quite reasonable here. Happens there's a recent
end-to-end example, see:


On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta
<> wrote:
> I don't have first hand knowledge of how you implement that, but I bet a
> look at the WordDelimiterFilter would help you understand how to emit
> multiple terms with the same positions pretty easily.
> I've heard of this "bag of word variants" approach to indexing poor-quality
> OCR output before for findability reasons and I heard it works out OK.
> Michael Della Bitta
> Applications Developer
> o: +1 646 532 3062
> appinions inc.
> “The Science of Influence Marketing”
> 18 East 41st Street
> New York, NY 10017
> t: @appinions <> | g+:
> <>
> w: <>
> On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <
>> wrote:
>> Hello,
>> Many of our indexed documents are scanned and OCR'ed documents.
>> Unfortunately we were not able to improve much the OCR quality (less than
>> 80% word accuracy) for various reasons, a fact which badly hurts the
>> retrieval quality.
>> As we use an open-source OCR, we think of changing every scanned term
>> output to it's main possible variations to get a higher level of
>> confidence.
>> Is there any analyser that supports this kind of need or should I make up a
>> syntax and analyser of my own, i.e the payload syntax?
>> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4
>> Thanks,
>> Manuel

View raw message