Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 66D0AD260 for ; Mon, 24 Sep 2012 20:50:09 +0000 (UTC) Received: (qmail 96662 invoked by uid 500); 24 Sep 2012 20:50:08 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 96599 invoked by uid 500); 24 Sep 2012 20:50:08 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 96591 invoked by uid 99); 24 Sep 2012 20:50:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Sep 2012 20:50:08 +0000 Date: Tue, 25 Sep 2012 07:50:08 +1100 (NCT) From: "Mark Miller (JIRA)" To: dev@lucene.apache.org Message-ID: <2064114111.118585.1348519808085.JavaMail.jiratomcat@arcas> In-Reply-To: <765767947.116514.1348494067825.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (SOLR-3875) Document boost does not work correctly when using multi-valued fields MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462105#comment-13462105 ] Mark Miller commented on SOLR-3875: ----------------------------------- bq. patch with proposed test & fix +1 I applied the patch, inspected the fix, inspected the test. It looks right to me. I also ran all tests, and verified the new test fails as expected without the fix. > Document boost does not work correctly when using multi-valued fields > --------------------------------------------------------------------- > > Key: SOLR-3875 > URL: https://issues.apache.org/jira/browse/SOLR-3875 > Project: Solr > Issue Type: Bug > Components: Schema and Analysis, update > Affects Versions: 4.0-BETA > Reporter: Toke Eskildsen > Priority: Critical > Fix For: 4.0, 4.1, 5.0 > > Attachments: SOLR-3875.patch > > > In Solr 4 BETA & trunk, document boosts skews the ranking for documents with multi value fields tremendously. A document boost of 5 combined with 15 values in a multi value field results in scores above 1,000,000,000, while a boost of 0,5 results in scores below 0,001. The error is not present in Solr 3.6. > Thomas Egense and I have tracked it down to a change in Solr DocumentBuilder committed 20110827 (@1162347) by Mike McCandless, as part of work done on LUCENE-2308. The problem is that Lucene multiplies the boosts of multiple instances of the same field when updating the index. > The old DocumentBuilder, used in Lucene 3.6, handled this by calculating the score for the field (docBoost*fieldBoost) and assigning it to the first instance of the field, then setting the boost to 1.0f and assigning that to subsequent instances of the field. This effectively assigned docBoost*fieldBoost to the field, regardless of the number of instances. > The updated DocumentBuilder (see https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_0/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java?revision=1388778&view=markup), used in Lucene 4 BETA & trunk, also assigns docBoost*fieldBoost to the first instance of the field. Then it sets fieldBoost = docBoost and continues to assign docBoost*fieldBoost to subsequent instances. Using the example mentioned above, the generated IndexableFields will get assigned boosts of 5, 5*5, 5*5... 5*5. As Lucene multiplies all the values, 15 instances of the same field will have a collective boost of 5*25^14. > This can be demonstrated with the Solr tutorial example by indexing the sample documents and adding the document > {code:xml} > > > Insane score Example. Score = 10E9 > Document boost broken for multivalued fields > Thomas Egense and Toke Eskildsen > Test > bug > insane_boost > something else > something else > something else > something else > something else > something else > something else > something else > something else > something else > something else > something else > something else > > > {code} > The _manu_ & _features_-fields gets copied to _text_ and a search for _thomas_ matches the _text_-field with query explanation > {code:xml} > > 2.44373361E10 = (MATCH) weight(text:thomas in 0) [DefaultSimilarity], result of: > 2.44373361E10 = fieldWeight in 0, product of: > 1.0 = tf(freq=1.0), with freq of: > 1.0 = termFreq=1.0 > 3.2512918 = idf(docFreq=3, maxDocs=38) > 7.5161928E9 = fieldNorm(doc=0) > > {code} > Thomas and I are too pressed for time to attempt a proper patch at the moment, but we guess that a reversion to the old algorithm of assigning the combined boost to the first instance and 1.0f to all subsequent instances would work? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org