lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Harwood (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-329) Fuzzy query scoring issues
Date Thu, 13 Nov 2008 09:55:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647243#action_12647243
] 

Mark Harwood commented on LUCENE-329:
-------------------------------------

This patch goes back a while.
Contrib's FuzzyLikeThisQuery contains my current "best practice" for fuzzy matching but the
logic is mixed in with code that also does "LikeThis" optimisations ie working out which input
terms are the best to search on rather than using all input terms. This could usefully be
lifted out and used elsewhere.

The fuzzy scoring logic takes the IDF of the input term and uses that as the IDF for scoring
all expanded variants. If the input term does not exist then all variants are rewarded with
their averaged IDF. Coord is disabled.

Using some form of IDF is typically desirable to balance a fuzzy query with other (potentially
non fuzzy) clauses in the overall user query. Within a fuzzy query (or wildcard or other auto-expanding
queries) however I see no reason to differentiate between the auto-expanded terms with different
IDF values. In my view these auto-expand queries should generally use the same IDF for all
variants and only reward them differently based on edit distance or what other distance metric
is meaningful to that form of expansion (e.g. age range query on age 40 +/- 10 years could
reward based on closeness to input term 40).

Cheers
Mark

> Fuzzy query scoring issues
> --------------------------
>
>                 Key: LUCENE-329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-329
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.2rc5
>         Environment: Operating System: All
> Platform: All
>            Reporter: Mark Harwood
>            Assignee: Lucene Developers
>            Priority: Minor
>         Attachments: patch.txt
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message