lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 21446] New: - Fuzzy Searches do not get a boost of 0.2 as stated in "Query Syntax" doc
Date Wed, 09 Jul 2003 20:46:45 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=21446>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=21446

Fuzzy Searches do not get a boost of 0.2 as stated in "Query Syntax" doc

           Summary: Fuzzy Searches do not get a boost of 0.2 as stated in
                    "Query Syntax" doc
           Product: Lucene
           Version: 1.2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: Search
        AssignedTo: lucene-dev@jakarta.apache.org
        ReportedBy: cormac@siderean.com


According to the website's "Query Syntax" page, fuzzy searches are given a
boost of 0.2. I've found this not to be the case, and have seen situations where
exact matches have lower relevance scores than fuzzy matches.

Rather than getting a boost of 0.2, it appears that all variations on the term
are first found in the model, where dist* > 0.5.

* dist = levenshteinDistance / length of min(termlength, variantlength)

This then leads to a boolean OR search of all the variant terms, each of whose
boost is set to (dist - 0.5)*2 for that variant.

The upshot of all of this is that there are many cases where a fuzzy match will
get a higher relevance score than an exact match.

See this email for a test case to reproduce this anomalous behaviour.
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02819.html

Here is a candidate patch to address the issue -

*** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java	Sun Jun 09
13:47:54 2002
--- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java	Fri
Mar 14 11:37:20 2003
***************
*** 99,105 ****
      }
      
      final protected float difference() {
!         return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
      }
      
      final public boolean endEnum() {
--- 99,109 ----
      }
      
      final protected float difference() {
! 		if (distance == 1.0) {
! 			return 1.0f;
! 		}
! 		else
! 			return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
      }
      
      final public boolean endEnum() {
***************
*** 111,117 ****
       ******************************/
      
      public static final double FUZZY_THRESHOLD = 0.5;
!     public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD);
      
      /**
       Finds and returns the smallest of three integers 
--- 115,121 ----
       ******************************/
      
      public static final double FUZZY_THRESHOLD = 0.5;
!     public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
FUZZY_THRESHOLD));
      
      /**
       Finds and returns the smallest of three integers

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message