lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw00d <>
Subject Re: Why exactly are fuzzy queries so slow?
Date Sun, 25 Nov 2007 10:54:15 GMT
For "fuzzy" you're going to pay one way or another.
You can use ngram analyzers on indexed content and queries which will 
add IO costs ("files" becomes "fi","fil", "file","il","ile","iles" in 
both your query and index) or you can use some form of query-time edit 
distance comparison on "files" and pay the CPU costs. You can use 
WordNet and "files" becomes "registers". You can examine large volumes 
of user queries and look at what is the most likely interpretation. You 
can use Soundex and then if you're lucky files==philes but there's no 
room for error and they either match or they dont - there is no measure 
of similarity.

There's no free lunch here.

Timo Nentwig wrote:
> On Saturday 24 November 2007 18:28:48 markharw00d wrote:
>> term. You can limit the number of edit distance comparisons conducted by
>> setting the minimum prefix length. This is a property of the QueryParser
> Well, javadoc: "prefixLength - length of common (non-fuzzy) prefix". So, this 
> is some kind of "wildcard fuzzy" but not real fuzzy anymore. 
> I understand the optimitation but right now I hardly can image a reasonable 
> use-case. Who care whether the levenstein distance is a the beginnen, middle 
> or end of word, .e.g when searching fuzzy for "philes" I want to 
> find "files"...
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message