Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Tue, 25 Sep 2012 07:50:08 +1100 (NCT)
From: "Mark Miller (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <2064114111.118585.1348519808085.JavaMail.jiratomcat@arcas>
In-Reply-To: <765767947.116514.1348494067825.JavaMail.jiratomcat@arcas>
Subject: [jira] [Commented] (SOLR-3875) Document boost does not work
 correctly when using multi-valued fields
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/SOLR-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462105#comment-13462105 ] 

Mark Miller commented on SOLR-3875:
-----------------------------------

bq. patch with proposed test & fix 

+1

I applied the patch, inspected the fix, inspected the test. It looks right to me.

I also ran all tests, and verified the new test fails as expected without the fix.
                
> Document boost does not work correctly when using multi-valued fields
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3875
>                 URL: https://issues.apache.org/jira/browse/SOLR-3875
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis, update
>    Affects Versions: 4.0-BETA
>            Reporter: Toke Eskildsen
>            Priority: Critical
>             Fix For: 4.0, 4.1, 5.0
>
>         Attachments: SOLR-3875.patch
>
>
> In Solr 4 BETA & trunk, document boosts skews the ranking for documents with multi value fields tremendously. A document boost of 5 combined with 15 values in a multi value field results in scores above 1,000,000,000, while a boost of 0,5 results in scores below 0,001. The error is not present in Solr 3.6.
> Thomas Egense and I have tracked it down to a change in Solr DocumentBuilder committed 20110827 (@1162347) by Mike McCandless, as part of work done on LUCENE-2308. The problem is that Lucene multiplies the boosts of multiple instances of the same field when updating the index.
> The old DocumentBuilder, used in Lucene 3.6, handled this by calculating the score for the field (docBoost*fieldBoost) and assigning it to the first instance of the field, then setting the boost to 1.0f and assigning that to subsequent instances of the field. This effectively assigned docBoost*fieldBoost to the field, regardless of the number of instances.
> The updated DocumentBuilder (see https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_0/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java?revision=1388778&view=markup), used in Lucene 4 BETA & trunk, also assigns docBoost*fieldBoost to the first instance of the field. Then it sets fieldBoost = docBoost and continues to assign docBoost*fieldBoost to subsequent instances. Using the example mentioned above, the generated IndexableFields will get assigned boosts of 5, 5*5, 5*5... 5*5. As Lucene multiplies all the values, 15 instances of the same field will have a collective boost of 5*25^14.
> This can be demonstrated with the Solr tutorial example by indexing the sample documents and adding the document 
> {code:xml}
> <add>
> <doc boost="5">
>   <field name="id">Insane score Example. Score = 10E9 </field>
>   <field name="name">Document boost broken for multivalued fields</field>
>   <field name="manu">Thomas Egense and Toke Eskildsen</field>
>   <field name="manu_id_s">Test</field>
>   <field name="cat">bug</field>
>   <field name="features">insane_boost</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>  
> </doc>
> </add>
> {code}
> The _manu_ & _features_-fields gets copied to _text_ and a search for _thomas_ matches the _text_-field with query explanation
> {code:xml}
> <str name="Insane score Example. Score = 10E10 ">
> 2.44373361E10 = (MATCH) weight(text:thomas in 0) [DefaultSimilarity], result of:
>   2.44373361E10 = fieldWeight in 0, product of:
>     1.0 = tf(freq=1.0), with freq of:
>       1.0 = termFreq=1.0
>     3.2512918 = idf(docFreq=3, maxDocs=38)
>     7.5161928E9 = fieldNorm(doc=0)
> </str>
> {code}
> Thomas and I are too pressed for time to attempt a proper patch at the moment, but we guess that a reversion to the old algorithm of assigning the combined boost to the first instance and 1.0f to all subsequent instances would work?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org