lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: sounds like spellcheck
Date Wed, 09 Feb 2005 13:30:39 GMT
On Feb 9, 2005, at 7:23 AM, Aad Nales wrote:
> In my Clipper days I could build an index on English words using a 
> technique that was called soundex. Searching in that index resulted in 
> hits of words that sounded the same. From what i remember this 
> technique only worked for English. Has it ever been generalized?

I do not know how Soundex/Metaphone/Double Metaphone work with 
non-English languages, but these algorithms are in Jakarta Commons 
Codec.  I used the Metaphone algorithm as a custom analyzer example in 
Lucene in Action.  You'll see it in the source code distribution under 
src/lia/analysis/codec.  I did a couple of variations, one that adds 
the metaphoned version as a token in the same position and one that 
simply replaces it in the token stream.

I even envisioned this sounds-like feature being used for children.  I 
was mulling over this idea while having lunch with my son one day last 
spring (he was 5 at the time).  I asked him how to spell "cool cat" and 
he replied "c-o-l c-a-t".  I tried it out with the metaphone algorithm 
and it matches!


> What i am trying to solve is this. A customer is looking for a 
> solution to spelling mistakes made by children (upto 10) when typing 
> in queries. The site is Dutch. Common mistakes are 'sgool' when 
> searching for 'school'. The 'normal' spellcheckers and suggestors 
> typically generate a list where the 'sounds like' candidates' are too 
> far away from the result. So what I am thinking about doing is this:
> 1. create a parser that takes a word and creates a soundindex entry.
> 2. create list of 'correctly' spelled words either based on the index 
> of the website or on some kind of dictionary.
> 2a. perhaps create a n-gram index based on these words
> 3. accept a query, figure out that a spelling mistake has been made
> 3a find alternatives by parsing the query and searching the 'sound 
> like index' and then calculate and order  the results
> Steps 2 and 3 have been discussed at length in this forum and have 
> even made it to the sandbox. What I am left with is 1.
> My thinking is processing a series of replacement statements that go 
> like:
> --
> g sounds like ch if the immediate predecessor is an s.
> o sounds like oo if the immediate predecessor is a consonant
> --
> But before I takes this to the next step I am wondering if anybody has 
> created or thought up alternative solutions?
> Cheers,
> Aad
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message