lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 21446] - Fuzzy Searches do not get a boost of 0.2 as stated in "Query Syntax" doc
Date Wed, 10 Sep 2003 18:01:50 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=21446>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=21446

Fuzzy Searches do not get a boost of 0.2 as stated in "Query Syntax" doc





------- Additional Comments From cormac@siderean.com  2003-09-10 18:01 -------
I will work on massaging my test case into a JUnit test.

Meanwhile, I chose the value of 0.2 simply because it is the documented
behavior, and therefore I considered that to be the expected, even desired,
behavior. That said, it does appear to be a randomly chosen value, although not
chosen by me :-)

Following the logic of how the scoring mechanism works (or at least my
understanding of it), this is not a universal fix, but rather as I state in my
original email on lucene-dev, it mitigates the problem. I chose the fix simply
as it brought the functionality in line with documented behavior.

The essence of the problem is the battle in scoring between levenshtein distance
 and term frequency - high frequency terms are scored lower than low frequency
terms. A good example of a low frequency term is a typo in a document. If the
original correctly spelled word has a very high frequency, the misspelled word
will come out on top, due to its significantly lower frequency.

By setting the boost to 0.2, We at least make it 5 times harder (in terms of
frequency) for the misspelled item to appear ahead of the correctly spelled
item. But this clearly means that it will still happen.

--Cormac

Mime
View raw message