lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aad Nales <>
Subject sounds like spellcheck
Date Wed, 09 Feb 2005 12:23:57 GMT
In my Clipper days I could build an index on English words using a 
technique that was called soundex. Searching in that index resulted in 
hits of words that sounded the same. From what i remember this technique 
only worked for English. Has it ever been generalized?

What i am trying to solve is this. A customer is looking for a solution 
to spelling mistakes made by children (upto 10) when typing in queries. 
The site is Dutch. Common mistakes are 'sgool' when searching for 
'school'. The 'normal' spellcheckers and suggestors typically generate a 
list where the 'sounds like' candidates' are too far away from the 
result. So what I am thinking about doing is this:

1. create a parser that takes a word and creates a soundindex entry.

2. create list of 'correctly' spelled words either based on the index of 
the website or on some kind of dictionary.
2a. perhaps create a n-gram index based on these words

3. accept a query, figure out that a spelling mistake has been made
3a find alternatives by parsing the query and searching the 'sound like 
index' and then calculate and order  the results

Steps 2 and 3 have been discussed at length in this forum and have even 
made it to the sandbox. What I am left with is 1.

My thinking is processing a series of replacement statements that go like:
g sounds like ch if the immediate predecessor is an s.
o sounds like oo if the immediate predecessor is a consonant

But before I takes this to the next step I am wondering if anybody has 
created or thought up alternative solutions?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message