lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua O'Madadhain <jmad...@ics.uci.edu>
Subject Re: Wrong spelling
Date Wed, 24 Jul 2002 21:09:24 GMT
On Wed, 24 Jul 2002, Olivier Amira wrote:

> I would like to implement in my Lucene application a google's like 
> feature like the "Did you mean" google's feature. So, when the user 
> enters a wrong spelling of a word, the search engine automatically 
> propose a similar better word. To implement such function in a Lucene 
> application, I'm not sure of what method is the best (or it's correct to 
> try to di this with a Lucene index). Is there anybody that could help-me 
> for this?

There are a couple of different approaches to this that I'm aware of.

(1) Find a list of commonly misspelled words, detect them in a query, and
prompt the user with the corresponding correctly spelled words.  Such
lists are pretty common.  Advantages: reasonably easy to implement,
computationally cheap, and most of the work (figuring out what words to
flag and what words to suggest in their place) is done statically.  
Disadvantages: it will catch 'speling' mistakes but not 'spellling'
mistakes (that is, it will only recognize errors that you tell it about).
This is entirely independent of the index unless you go to the trouble of
removing entries from this auxiliary data structure that correspond to
words that aren't in the index anyway.

(2) There's something in the Lucene API docs about a FuzzyQuery that
mentions Levenshtein distance (= string edit distance, I believe).  I
haven't looked into this myself, but I would guess that you should be able
to construct a FuzzyQuery that specifies a maximum string edit distance
between a specified search term and other terms in the index.  
Unfortunately, the API docs are just about that helpful; FuzzyTermEnum has
more information but doesn't tell you how to use FuzzyQuery.  On the other
hand at least you now know where to look in the source code.  :)
Advantages: more flexible, seems like it's built in; disadvantages: docs
not helpful, will probably slow your query down more than (1) would.

You could also try to write your own string edit distance calculator/data
structure, but I don't have any quick answers as to how to do that.

Good luck--

Joshua O'Madadhain 

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.





--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message