lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aad Nales" <aad.na...@rotterdam-cs.com>
Subject n-gram indexing for generating spell suggestions
Date Mon, 18 Oct 2004 15:58:59 GMT
Hi All,

After having used the suggested algoritms for a few weeks I found that
the suggestions were not completely to my liking;

1. small words (1,2 or 3) characters were never corrected. Especially
three letter words and abbreviations suffered.
2. often used misspelings in Dutch words between 4 and 5 characters were
missed. E.g. 'fiets' was suggested as a possible spell suggestion for
'feits' since no matching 3gram exist between the two. The same held
true for misspellings based on 'ch' and 'g' both being the same sound in
Dutch but written differently.
3. words that could never be part of a suggestion were added based on a
single matchting n-gram. (e.g. if I ask for suggestions on 'per' then
tupperware is also suggested. But solely based on length it is clear
that it has a minimal distance of 7. 

This is how i solved it.

I introduced two new fields to the ngram index

1. length (LEN)
2. firstletterlastcharacter (FLLC)


At index time I use the following rules:

for all words the length is stored;

for all words of 1,2,3 character i store the 1 grams
for all words of 4 character i store the 2 grams and the FLLC field
for all words of 5 and 6 character i store the 3 grams and the FLLC
field
for all words of 7 or more characters i store the 3 grams

When searching I limit the search to those fields that have a length
[goal.length - maxd, goal.length+d] and based on the possible outcomes I
apply the 1,2,3 and the FLLC field same as at index time.

The results are ordered first on distance and second on the difference
in length. So if two suggestions share the levenhstein distance with the
goal then the suggestion that has the lowest (goal.length -
suggestion.length) gets priority.

The result for me is a much more natural spellcheck result. I am not at
all sure if these results translate to English. But with this approach I
can guarantee that all spellingmistakes that involve changing two
characters and all spelling mistakes that involve ommitting or adding
one character are found (or at least that is what i think now ;-).

Just want to thank the people again that responsed to my original mail,
your great suggestions have greatly improved the user and technical
performance of my solution.


--
Aad Nales



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message