lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: FuzzyQuery info
Date Wed, 03 Mar 2004 21:49:07 GMT
For solving misspellings, you could try encoding and indexing the
content using soundex or double metaphone algoright.
There is at least one free product that uses Lucene and those two
algos, and it's linked from one of the pages on Lucene site.
Phonetix?

Otis



--- Supun Edirisinghe <supun@office.vtourist.com> wrote:
> thanks again for the info, Erik.
> 
> I am looking at a great big index; I expect that we will always have
> .
> So, the FuzzyQuery is less viable now. 
> 
> as for what I'm trying to do: We have a site search that uses Lucene
> in
> a basic way. It needs to be improved and I'm trying to research all
> the
> features of Lucene that we are not using. The reason for looking at
> FuzzyQuery was because terms in our database get mispelled and get
> typed
> in wierd ways by our users. The searches are done through a web
> interface on a very busy site, so performance is also a big issue.
> Alot
> of user generated content is indexed. 
> 
> thanks for your help.
> 
> 
> Also, Jim Hargrave  had a good explaination too:
> 
> As I understand it the Levenshtein algorithm is applied to each term
> in
> the index - then a Boolean query (OR) is formed from all the htis
> above
> a certain threshold. The Leven. algorithm is quadratic - so can be
> slow
> for larger strings. Smaller strings aren' too bad. However, if you
> have
> thousands of terms in your index this can be VERY slow. Especially if
> the query is redone for every 50 Lucene hits (see previous post on
> wildcard performace).
>  
> IMHO, this is a part of Lucene that could be better. Using an "agrep"
> type algorithm would speed things up. 
>  
> Jim
> 
> 
> On Tue, 2004-03-02 at 10:43, Erik Hatcher wrote:
> > On Mar 2, 2004, at 1:23 PM, Supun Edirisinghe wrote:
> > > now, one more question: what are the big performance hits from
> using a
> > > FuzzyQuery. what are some bad cases to use it(eg. many words in
> the
> > > phrase? long strings? ) would it be better to read up on the 
> > > Levenshtein
> > > algorithm or to get into the internals of Lucene and compare what
> is
> > > done in FuzzyQuery as opposed to something simpler like
> PhraseQuery
> > 
> > To do a FuzzyQuery, *every* term in the field needs to be
> enumerated 
> > and checked.  This may or may not be a problem depending on the
> size of 
> > your index, but it is definitely worth noting.
> > 
> > What are you trying to accomplish?
> > 
> > 	Erik
> > 
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message