Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 81771 invoked from network); 5 Oct 2006 18:51:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 5 Oct 2006 18:51:50 -0000 Received: (qmail 46952 invoked by uid 500); 5 Oct 2006 18:51:40 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 46867 invoked by uid 500); 5 Oct 2006 18:51:40 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 46522 invoked by uid 99); 5 Oct 2006 18:51:39 -0000 Received: from idunn.apache.osuosl.org (HELO idunn.apache.osuosl.org) (140.211.166.84) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Oct 2006 11:51:39 -0700 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests= Received: from [209.237.227.198] ([209.237.227.198:37703] helo=brutus.apache.org) by idunn.apache.osuosl.org (ecelerity 2.1.1.8 r(12930)) with ESMTP id C7/A7-04543-73455254 for ; Thu, 05 Oct 2006 11:51:35 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 7982F7142F9 for ; Thu, 5 Oct 2006 11:51:22 -0700 (PDT) Message-ID: <27542586.1160074282495.JavaMail.root@brutus> Date: Thu, 5 Oct 2006 11:51:22 -0700 (PDT) From: "Hoss Man (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-329) Fuzzy query scoring issues MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/LUCENE-329?page=comments#action_12440213 ] Hoss Man commented on LUCENE-329: --------------------------------- I'm not very familiar with this issue, but a quick review of the patch and the existing comments lead me to believe that commiting it as is wouldn't be the cleanest way to deal with this problem: 1) As Mark mentioned, a "disableCoord" option has been added to BooleanQuery which makes the ExpandedTermsQuery class in this patch unneccessary. furthermore, the ExpandedTermsQuery class is (in my opinion) broken because it ignores any user specific Similarity completely (instead of just ignoring the coord factor) 2) This patch does not include any test cases. Furthermore, this patch (assuming it does what it says it does) would fundementally alter the scoring of a FuzzyQuery -- the new scoring may seem like the right thing to do for you and for Mark, but without a larger concensus I'm a little worried that perhaps there are just as many FuzzyQuery users out there who are happy with the current behavior and would consider this change a bug. So any "fix" for this should either be configurable, or have an overwelming amount of support that the change is the "right" thing to do. (The main issue seems to be that terms with low idf have a larger impact, and that results in rare mis-spellings getting higher scores -- but assuming "clean data" with no mis-spellings, scoring "rare" terms higher seems like the ideal behavior, correct?) > Fuzzy query scoring issues > -------------------------- > > Key: LUCENE-329 > URL: http://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Java > Issue Type: Bug > Components: Search > Affects Versions: 1.2rc5 > Environment: Operating System: All > Platform: All > Reporter: Mark Harwood > Assigned To: Lucene Developers > Attachments: patch.txt > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org