lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Date Tue, 14 Sep 2004 21:18:57 GMT
Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>> ...or prepare in advance a fast lookup index - split all existing
>>> terms to bi- or trigrams, create a separate lookup index, and then
>>> simply for each term ask a phrase query (phrase = all n-grams from
>>> an input term), with a slop > 0, to get similar existing terms.
>>> This should be fast, and you could provide a "did you mean"
>>> function too...
>>>
>>
>> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
>> phases. First you build a "fast lookup index" as mentioned above.
>> Then to correct a word you do a query in this index based on the
>> ngrams in the misspelled word.
>>
> 
> The background for this suggestion was that I was playing some time ago 
> with a Luke plugin that builds various sorts of ancillary indexes, but 
> then I never finished it... Kudos for actually making it work ;-)

Sure, it was a fun little edge project. For the most part the code was 
done last week right after this thread appeared, but it always takes a 
while to get it from 95 to 100%.

> 
>> [1] Source is attached and I'd like to contribute it to the sandbox,
>> esp if someone can validate that what it's doing is reasonable and
>> useful.
> 
> 
> There have been many requests for this or similar functionality in the 
> past, I believe it should go into sandbox.
> 
> I was wondering about the way you build the n-gram queries. You 
> basically don't care about their position in the input term. Originally 
> I thought about using PhraseQuery with a slop - however, after checking 
> the source of PhraseQuery I realized that this probably wouldn't be that 
> fast... You use BooleanQuery and start/end boosts instead, which may 
> give similar results in the end but much cheaper.
> 
> I also wonder how this algorithm would behave for smaller values of 

Sure, I'll try to rebuild the demo w/ lengths 2-5 (and then the query 
page can test any conitguous combo).

> start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
> length, the more "fuzziness" you introduce, which may or may not be 
> desirable (increased recall at the cost of precision - for small indexes 
> this may be useful from the user's perspective because you will always 
> get a plausible hit, for huge indexes it's a loss).
> 
>>
>> [2] Here's a demo page. I built an ngram index for ngrams of length 3
>>  and 4 based on the existing index I have of approx 100k 
>> javadoc-generated pages. You type in a misspelled word like
>> "recursixe" or whatnot to see what suggestions it returns. Note this
>> is not a normal search index query -- rather this is a test page for
>> spelling corrections.
>>
>> http://www.searchmorph.com/kat/spell.jsp
> 
> 
> Very nice demo! 

Thanks, kinda designed for ngram-nerds if you know what I mean :)

> I bet it's running way faster than the linear search 

Indeed, this is almost zero time, whereas the simple and dumb linear 
search was taking me 10sec. I will have to redo the sites main search 
page so it uses this new code, TBD, prob tomorrow.

> over terms :-), even though you have to build the index in advance. But 
> if you work with static or mostly static indexes this doesn't matter.
> 
>> Based on a subsequent mail in this thread I set boosts for the words
>> in the ngram index. The background is each word (er..term for a given
>>  field) in the orig index is a separate Document in the ngram index.
>> This Doc contains all ngrams (in my test case, like #2 above, of
>> length 3 and 4) of the word. I also set a boost of
>> log(word_freq)/log(num_docs) so that more frequently words will tend
>> to be suggested more often.
> 
> 
> You may want to experiment with 2 <= n <= 5. Some n-gram based 
Yep, will do prob tomorrow.

> techniques use all lengths together, some others use just single length, 
> results also vary depending on the language...
> 
>>
>> I think in "plain" English then the way a word is suggested as a 
>> spelling correction is: - frequently occuring words score higher -
>> words that share more ngrams with the orig word score higher - words
>> that share rare ngrams with the orig word score higher
> 
> 
> I think this is a reasonable heuristics. Reading the code I would 
> present it this way:

ok, thx, will update

> 
> - words that share more ngrams with the orig word score higher, and
>   words that share rare ngrams with the orig word score higher
>   (as a natural consequence of using BooleanQuery),
> 
> - and, frequently occuring words score higher (as a consequence of using
>   per-Document boosts),
> 
> - from reading the source code I see that you use Levenshtein distance
>   to prune the resultset of too long/too short results,
> 
> I think also that because you don't use the positional information about 
>  the input n-grams you may be getting some really weird hits.

Good point, though I haven't seen this yet. Might be due to the prefix 
boost and maybe some Markov chain magic tending to only show reasonable 
words.

> You could 
> prune them by simply checking if you find a (threshold) of input ngrams 
> in the right sequence in the found terms. This shouldn't be too costly 

Good point, I'll try to add that in as an optional parameter.

> because you operate on a small result set.
> 
>>
>> [6]
>>
>> If people want to vote me in as a committer to the sandbox then I can
> 
> 
> Well, someone needs to maintain the code after all... ;-)
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message