lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Jain" <Eric.J...@isb-sib.ch>
Subject Re: misspelled queries
Date Fri, 27 Jun 2003 12:15:26 GMT
> I've been thinking about trying to implement a misspelled or a
> similarity match, ala googles "did you mean this ....".

This is what I do: If a query yields a low number of results, and one of
the terms does not occur in the index, or not very often, then the term
that occurs most often in the index among all terms that are similar to
the original term is suggested as a correction.

Works pretty well most of the time, and when not, it's usually funny :-)

Counting the number of occurrences of a term in an index can be done
efficiently using indexReader.docFreq(term).

See FuzzyTermEnum how to list all similar terms. Depending on the size
of your index, you will probably have to create your own version. Most
effective optimization: Include only terms that start with the same two
or three characters in the enumeration with
super.setEnum(indexReader.terms) in the constructor of your TermEnum.

Runs within milliseconds on a half-gigabyte index.

--
Eric Jain


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message