lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Lef <ale...@sciquest.com>
Subject RE: n-gram indexing for generating spell suggestions
Date Mon, 18 Oct 2004 16:15:52 GMT
You can also store a phonetic key for the term to find sounds-like matches.
I use double metaphone algorithm which appears to be English specific. Not
sure if there is something out there for Dutch.

For the length, I use relative distance cutoff (distance/length) in addition
to the absolute length cutoff that doesn't work very well for short words
(as you mentioned).

Alexey

-----Original Message-----
From: Aad Nales [mailto:aad.nales@rotterdam-cs.com] 
Sent: Monday, October 18, 2004 11:59 AM
To: lucene-user@jakarta.apache.org
Subject: n-gram indexing for generating spell suggestions

...

2. often used misspelings in Dutch words between 4 and 5 characters were
missed. E.g. 'fiets' was suggested as a possible spell suggestion for
'feits' since no matching 3gram exist between the two. The same held
true for misspellings based on 'ch' and 'g' both being the same sound in
Dutch but written differently.

3. words that could never be part of a suggestion were added based on a
single matchting n-gram. (e.g. if I ask for suggestions on 'per' then
tupperware is also suggested. But solely based on length it is clear
that it has a minimal distance of 7. 

...

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message