lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in "Query Syntax" doc
Date Mon, 15 Feb 2010 16:03:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833865#action_12833865
] 

Robert Muir commented on LUCENE-124:
------------------------------------

I will wait till after the code freeze and commit this in a few days if no one objects.

I don't claim its a 'best-practice' fix for fuzzy (see LUCENE-329 for ideas on that), I just
think TOP_TERMS_CONSTANT_BOOLEAN_REWRITE is a useful complement to TOP_TERMS_SCORING_BOOLEAN_REWRITE,
for MultiTermQueries that want the Top-N terms expansion, but the constant score behavior
of CONSTANT_BOOLEAN_REWRITE.

this patch doesnt change any defaults for fuzzy either. in fact its not specific to fuzzy
at all.

> Fuzzy Searches do not get a boost of 0.2 as stated in "Query Syntax" doc
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-124
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.2
>         Environment: Operating System: All
> Platform: All
>            Reporter: Cormac Twomey
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-124.patch
>
>
> According to the website's "Query Syntax" page, fuzzy searches are given a
> boost of 0.2. I've found this not to be the case, and have seen situations where
> exact matches have lower relevance scores than fuzzy matches.
> Rather than getting a boost of 0.2, it appears that all variations on the term
> are first found in the model, where dist* > 0.5.
> * dist = levenshteinDistance / length of min(termlength, variantlength)
> This then leads to a boolean OR search of all the variant terms, each of whose
> boost is set to (dist - 0.5)*2 for that variant.
> The upshot of all of this is that there are many cases where a fuzzy match will
> get a higher relevance score than an exact match.
> See this email for a test case to reproduce this anomalous behaviour.
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02819.html
> Here is a candidate patch to address the issue -
> *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java	Sun Jun 09
> 13:47:54 2002
> --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java	Fri
> Mar 14 11:37:20 2003
> ***************
> *** 99,105 ****
>       }
>       
>       final protected float difference() {
> !         return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
>       }
>       
>       final public boolean endEnum() {
> --- 99,109 ----
>       }
>       
>       final protected float difference() {
> ! 		if (distance == 1.0) {
> ! 			return 1.0f;
> ! 		}
> ! 		else
> ! 			return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
>       }
>       
>       final public boolean endEnum() {
> ***************
> *** 111,117 ****
>        ******************************/
>       
>       public static final double FUZZY_THRESHOLD = 0.5;
> !     public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD);
>       
>       /**
>        Finds and returns the smallest of three integers 
> --- 115,121 ----
>        ******************************/
>       
>       public static final double FUZZY_THRESHOLD = 0.5;
> !     public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
> FUZZY_THRESHOLD));
>       
>       /**
>        Finds and returns the smallest of three integers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message