lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Harwood (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in "Query Syntax" doc
Date Thu, 22 Sep 2005 19:50:27 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-124?page=comments#action_12330223 ] 

Mark Harwood commented on LUCENE-124:
-------------------------------------

I would suggest this is a duplicate of http://issues.apache.org/jira/browse/LUCENE-329

The idf rating of expanded terms should be the same and not favour rarer terms. I suggest
that this applies to all auto-expanding searches eg range queries.

Should we drop this bug as a duplicate?

> Fuzzy Searches do not get a boost of 0.2 as stated in "Query Syntax" doc
> ------------------------------------------------------------------------
>
>          Key: LUCENE-124
>          URL: http://issues.apache.org/jira/browse/LUCENE-124
>      Project: Lucene - Java
>         Type: Bug
>   Components: Search
>     Versions: 1.2
>  Environment: Operating System: All
> Platform: All
>     Reporter: Cormac Twomey
>     Assignee: Lucene Developers

>
> According to the website's "Query Syntax" page, fuzzy searches are given a
> boost of 0.2. I've found this not to be the case, and have seen situations where
> exact matches have lower relevance scores than fuzzy matches.
> Rather than getting a boost of 0.2, it appears that all variations on the term
> are first found in the model, where dist* > 0.5.
> * dist = levenshteinDistance / length of min(termlength, variantlength)
> This then leads to a boolean OR search of all the variant terms, each of whose
> boost is set to (dist - 0.5)*2 for that variant.
> The upshot of all of this is that there are many cases where a fuzzy match will
> get a higher relevance score than an exact match.
> See this email for a test case to reproduce this anomalous behaviour.
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02819.html
> Here is a candidate patch to address the issue -
> *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java	Sun Jun 09
> 13:47:54 2002
> --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java	Fri
> Mar 14 11:37:20 2003
> ***************
> *** 99,105 ****
>       }
>       
>       final protected float difference() {
> !         return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
>       }
>       
>       final public boolean endEnum() {
> --- 99,109 ----
>       }
>       
>       final protected float difference() {
> ! 		if (distance == 1.0) {
> ! 			return 1.0f;
> ! 		}
> ! 		else
> ! 			return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
>       }
>       
>       final public boolean endEnum() {
> ***************
> *** 111,117 ****
>        ******************************/
>       
>       public static final double FUZZY_THRESHOLD = 0.5;
> !     public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD);
>       
>       /**
>        Finds and returns the smallest of three integers 
> --- 115,121 ----
>        ******************************/
>       
>       public static final double FUZZY_THRESHOLD = 0.5;
> !     public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
> FUZZY_THRESHOLD));
>       
>       /**
>        Finds and returns the smallest of three integers

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message