Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 70831 invoked from network); 10 Feb 2006 08:42:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 10 Feb 2006 08:42:26 -0000 Received: (qmail 76163 invoked by uid 500); 10 Feb 2006 08:42:22 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 76102 invoked by uid 500); 10 Feb 2006 08:42:22 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 76072 invoked by uid 99); 10 Feb 2006 08:42:21 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO ajax.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Feb 2006 00:42:18 -0800 Received: from ajax.apache.org (ajax.apache.org [127.0.0.1]) by ajax.apache.org (Postfix) with ESMTP id 8C69CDF for ; Fri, 10 Feb 2006 09:41:57 +0100 (CET) Message-ID: <2993277.1139560917573.JavaMail.jira@ajax.apache.org> Date: Fri, 10 Feb 2006 09:41:57 +0100 (CET) From: "Hoss Man (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/LUCENE-323?page=comments#action_12365858 ] Hoss Man commented on LUCENE-323: --------------------------------- The WIkipediaSimilarity seems to only have been included as an example for the purposes of comparison testing, not as an item to be commited. Given Chuck's comment on 21/Dec/05 I'm of the opinion this issue should be closed. > [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields > ----------------------------------------------------------------------------------------------------------------- > > Key: LUCENE-323 > URL: http://issues.apache.org/jira/browse/LUCENE-323 > Project: Lucene - Java > Type: Bug > Components: QueryParser > Versions: 1.4 > Environment: Operating System: Windows XP > Platform: PC > Reporter: Chuck Williams > Assignee: Lucene Developers > Attachments: DisjunctionMaxQuery.java, DisjunctionMaxScorer.java, TestDisjunctionMaxQuery.java, TestMaxDisjunctionQuery.java, TestRanking.zip, TestRanking.zip, TestRanking.zip, WikipediaSimilarity.java, WikipediaSimilarity.java, WikipediaSimilarity.java, dms.tar.gz > > The attached test case demonstrates this problem and provides a fix: > 1. Use a custom similarity to eliminate all tf and idf effects, just to > isolate what is being tested. > 2. Create two documents doc1 and doc2, each with two fields title and > description. doc1 has "elephant" in title and "elephant" in description. > doc2 has "elephant" in title and "albino" in description. > 3. Express query for "albino elephant" against both fields. > Problems: > a. MultiFieldQueryParser won't recognize either document as containing > both terms, due to the way it expands the query across fields. > b. Expressing query as "title:albino description:albino title:elephant > description:elephant" will score both documents equivalently, since each > matches two query terms. > 4. Comparison to MaxDisjunctionQuery and my method for expanding queries > across fields. Using notation that () represents a BooleanQuery and ( | ) > represents a MaxDisjunctionQuery, "albino elephant" expands to: > ( (title:albino | description:albino) > (title:elephant | description:elephant) ) > This will recognize that doc2 has both terms matched while doc1 only has 1 > term matched, score doc2 over doc1. > Refinement note: the actual expansion for "albino query" that I use is: > ( (title:albino | description:albino)~0.1 > (title:elephant | description:elephant)~0.1 ) > This causes the score of each MaxDisjunctionQuery to be the score of highest > scoring MDQ subclause plus 0.1 times the sum of the scores of the other MDQ > subclauses. Thus, doc1 gets some credit for also having "elephant" in the > description but only 1/10 as much as doc2 gets for covering another query term > in its description. If doc3 has "elephant" in title and both "albino" > and "elephant" in the description, then with the actual refined expansion, it > gets the highest score of all (whereas with pure max, without the 0.1, it > would get the same score as doc2). > In real apps, tf's and idf's also come into play of course, but can affect > these either way (i.e., mitigate this fundamental problem or exacerbate it). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org