lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Supun Edirisinghe <su...@office.vtourist.com>
Subject Re: FuzzyQuery info
Date Wed, 03 Mar 2004 18:37:58 GMT
thanks again for the info, Erik.

I am looking at a great big index; I expect that we will always have .
So, the FuzzyQuery is less viable now. 

as for what I'm trying to do: We have a site search that uses Lucene in
a basic way. It needs to be improved and I'm trying to research all the
features of Lucene that we are not using. The reason for looking at
FuzzyQuery was because terms in our database get mispelled and get typed
in wierd ways by our users. The searches are done through a web
interface on a very busy site, so performance is also a big issue. Alot
of user generated content is indexed. 

thanks for your help.


Also, Jim Hargrave  had a good explaination too:

As I understand it the Levenshtein algorithm is applied to each term in
the index - then a Boolean query (OR) is formed from all the htis above
a certain threshold. The Leven. algorithm is quadratic - so can be slow
for larger strings. Smaller strings aren' too bad. However, if you have
thousands of terms in your index this can be VERY slow. Especially if
the query is redone for every 50 Lucene hits (see previous post on
wildcard performace).
 
IMHO, this is a part of Lucene that could be better. Using an "agrep"
type algorithm would speed things up. 
 
Jim


On Tue, 2004-03-02 at 10:43, Erik Hatcher wrote:
> On Mar 2, 2004, at 1:23 PM, Supun Edirisinghe wrote:
> > now, one more question: what are the big performance hits from using a
> > FuzzyQuery. what are some bad cases to use it(eg. many words in the
> > phrase? long strings? ) would it be better to read up on the 
> > Levenshtein
> > algorithm or to get into the internals of Lucene and compare what is
> > done in FuzzyQuery as opposed to something simpler like PhraseQuery
> 
> To do a FuzzyQuery, *every* term in the field needs to be enumerated 
> and checked.  This may or may not be a problem depending on the size of 
> your index, but it is definitely worth noting.
> 
> What are you trying to accomplish?
> 
> 	Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message