lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: combining open office spellchecker with Lucene
Date Thu, 09 Sep 2004 16:51:02 GMT
Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>
>> I can/should send the code out. The logic is that for any terms in a 
>> query that have zero matches, go thru all the terms(!) and calculate 
>> the Levenshtein string distance, and return the best matches. A more 
>> intelligent way of doing this is to instead look for terms that also 
>> match on the 1st "n" (prob 3) chars.
> 
> 
> ...or prepare in advance a fast lookup index - split all existing terms 
> to bi- or trigrams, create a separate lookup index, and then simply for 
> each term ask a phrase query (phrase = all n-grams from an input term), 
> with a slop > 0, to get similar existing terms. This should be fast, and 
> you could provide a "did you mean" function too...

Sounds interesting/fun but I'm not sure if I'm following exactly.

Let's talk thru the trigram index case.
Are you saying that for every trigram in every word there will be a 
mapping of trigram -> term?
Thus if "recursive" is in the (orig) index then we'd create entries like:

rec -> recursive
ecu -> ...
cur -> ...
urs -> ...
rsi -> ...
siv -> ...
ive -> ...

And so on for all terms in the orig index.
OK fine.
But now the user types in a query like "recursivz".
What's the algorithm - obviously I guess take all trigrams in the bad 
term and go thru the trigram-index, but there will be lots of 
suggestions. Now what - use string distance to score them? I guess that 
makes sense - plz confirm if I understand.... And so I guess the point 
here is we precalculate the trigram->term mappings to avoid an expensive 
traversal of all terms in an index, but we still use string distance as 
a 2nd pass (and prob should force the matches to always match on the 1st 
   n (3) chars using the heuristic that people can usually start the 
spelling a word  corrrectly).


> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message