lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Harwood (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-329) Fuzzy query scoring issues
Date Mon, 15 Feb 2010 17:07:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833876#action_12833876
] 

Mark Harwood edited comment on LUCENE-329 at 2/15/10 5:05 PM:
--------------------------------------------------------------

bq. consider simpler case

OK - but we need to remember that it is important to achieve balance _across_ different fuzzy
queries as well as terms _within_ the same fuzzy query.
Let's stick to the terms within a single fuzzy query for now:

bq. I guess you would like to score the second term higher, meaning Lower frequency

No, variant's frequency is not a deciding factor - only edit distance. "Johana" is similarity
0.6 while "Joahn" is 0.2 so I would favour result one  (although the this difference seems
a little off in this case)
The basic assumption is that user's input is valid and not a typo (deriving spelling suggestions
etc are a different topic and one we shouldnt try cover here). 
Fuzzy matching can drag in all sorts of unqualified variants with massively different frequencies.
Because we cannot control these discrepancies we should reward all these alternatives using
the known factors we have to hand - the IDF of the user's supposedly valid input and the similarity
measure of each variant compared to the input.
We could get fancy about probability of variants given the other input terms in the query
but that feels like its straying into spell checker territory and ngrams etc.

      was (Author: markh):
    bq. consider simpler case

OK - but we need to remember that it is important to achieve balance _across_ different fuzzy
queries as well as terms _within_ the same fuzzy query.
Let's stick to the terms within a single fuzzy query for now:

bq. I guess you would like to score the second term higher, meaning Lower frequency

No, variant's frequency is not a deciding factor - only edit distance. "Johana" is similarity
0.6 while "Johana" is 0.2 so I would favour result one  (although the this difference seems
a little off in this case)
The basic assumption is that user's input is valid and not a typo (deriving spelling suggestions
etc are a different topic and one we shouldnt try cover here). 
Fuzzy matching can drag in all sorts of unqualified variants with massively different frequencies.
Because we cannot control these discrepancies we should reward all these alternatives using
the known factors we have to hand - the IDF of the user's supposedly valid input and the similarity
measure of each variant compared to the input.
We could get fancy about probability of variants given the other input terms in the query
but that feels like its straying into spell checker territory and ngrams etc.
  
> Fuzzy query scoring issues
> --------------------------
>
>                 Key: LUCENE-329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-329
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.2rc5
>         Environment: Operating System: All
> Platform: All
>            Reporter: Mark Harwood
>            Priority: Minor
>         Attachments: patch.txt
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message