lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cormac Twomey" <cor...@siderean.com>
Subject Fuzzy queries not given boost of 0.2, as stated on website
Date Fri, 14 Mar 2003 20:55:13 GMT

According to the website's "Query Syntax" page, fuzzy searches are given a
boost of 0.2. I've found this not to be the case. Rather, it appears to me
(please confirm) that all variations on the term are found in the model,
whose dist (dist = levenshteinDistance / length of min(termlength,
variantlength)) is greater than 0.5. This then leads to a boolean OR
search of all the variant terms, each of whose boost is set to (dist -
0.5)*2 for that variant.

Is that more or less correct?

This means that, for example, given a document set with the following
search field values:
	"adagio b"
	"adagio c"
	"adagio d"
	"adagio e"
	"adagio f"
	"adagio g"
	"adagia m"	// Note the variation from 'adagio'
	"quincy b"
	"quincy c"
	"quincy d"
	"quincy e"
	"quincy f"
	"quincy g"

A search for "adagio" will actually yield "Adagia m" as the number one
result, even though it has a greater levenshtein distance from the search
term than a number of exact matches. This is due to the term "Adagia m"
having a much lower text frequency, I believe. Thus the promotion "Adagia
m" gets due to its high Similarity.tf() score more than outweighs the
boost of > 0.8 it gets, versus the 1.0 that the exact matches receive, in
this example.

Proposed solution:
If the boost calculated above for *non-exact match* fuzzy terms was
multiplied by 0.2, but not for exact matching terms, this problem would be
mitigated. Thoughts?

While puzzling through this, I threw together a little test app, which
creates an index with the above strings in it, and passes in your command
line arguments as search terms. You can find it at:

http://patrick.bpallen.com/~cormac/levtest.java
Usage:   java -classpath lucene-1.2.jar:. levtest search-terms
(Replace 'search-terms' with your search query).

Incidentally, this tool is also useful for confirming the bug (#18014) I
just posted, that fuzzy searches are case sensitive. Use the tool to
search for 'ADAgio~' and no results come back.

Regards,
--Cormac Twomey



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message