lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3875) Document boost does not work correctly when using multi-valued fields
Date Tue, 23 Oct 2012 09:43:12 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482233#comment-13482233
] 

Toke Eskildsen commented on SOLR-3875:
--------------------------------------

Unfortunately, the bug is only partly solved. Thomas and I encountered strange scores again.
While boosting of multi-value fields is handled correctly in Solr 4.0.0, boosting for copyFields
are not. A sample document:

{code}
   <add><doc boost="10.0">
  <field name="id">Insane score Example. Score = 10E9 </field>
  <field name="name">Document boost broken for copyFields</field>
  <field name="manu" >video ThomasEgense and Toke Eskildsen</field>
  <field name="manu_id_s">Test</field>
  <field name="cat">bug</field>
  <field name="features">something else</field>
  <field name="keywords">bug</field>
  <field name="content">bug</field>
  </doc></add>
{code}

The fields _name_, _manu_, _cat_, _features_, keywords and _content_ gets copied to text and
a search for thomasegense matches the text-field with query explanation

{code}
70384.67 = (MATCH) weight(text:thomasegense in 0) [DefaultSimilarity], result of:
  70384.67 = fieldWeight in 0, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.30685282 = idf(docFreq=1, maxDocs=1)
    229376.0 = fieldNorm(doc=0)
{code}

If the two last fields _keywords_ and _content_ are removed from the sample document, the
score is reduced by a factor 100 (docBoost^2).

The current DocumentBuilder https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_0/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java?revision=1389648&view=markup
works roughly like this:

{code}
foreach (field) {
  boost = docBoost*fieldBoost
  foreach (value) {
    assignField(field, value, boost)
    foreach (copyField) {
      assignField(copyField, value, boost)
    }
    boost = 1f
  }
}
{code}

When all fields share the same copyField (_text_ in this example), the copyField will have
the full boost assigned for each directly specified field which uses that copyField. That's
5 times with the provided sample, so the total boost for the field _text_ will be 10^5.

One solution would be to keep track of used fields (directly specified as well as copyFields)
and only assign the full boost once per document. If the number of unique fields/document
is low, a simple list would probably be the fastest and with low GC impact. For a higher number
of unique fields, a Set might be better. An optimization would be to only create the tracking
structure once a boost != 1.0f is encountered and only store the fields with boost != 1.0f,
so that an update without boosts would not get a performance penalty.
                
> Document boost does not work correctly when using multi-valued fields
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3875
>                 URL: https://issues.apache.org/jira/browse/SOLR-3875
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis, update
>    Affects Versions: 4.0-BETA
>            Reporter: Toke Eskildsen
>            Assignee: Hoss Man
>            Priority: Critical
>             Fix For: 4.0, 4.1, 5.0
>
>         Attachments: SOLR-3875.patch
>
>
> In Solr 4 BETA & trunk, document boosts skews the ranking for documents with multi
value fields tremendously. A document boost of 5 combined with 15 values in a multi value
field results in scores above 1,000,000,000, while a boost of 0,5 results in scores below
0,001. The error is not present in Solr 3.6.
> Thomas Egense and I have tracked it down to a change in Solr DocumentBuilder committed
20110827 (@1162347) by Mike McCandless, as part of work done on LUCENE-2308. The problem is
that Lucene multiplies the boosts of multiple instances of the same field when updating the
index.
> The old DocumentBuilder, used in Lucene 3.6, handled this by calculating the score for
the field (docBoost*fieldBoost) and assigning it to the first instance of the field, then
setting the boost to 1.0f and assigning that to subsequent instances of the field. This effectively
assigned docBoost*fieldBoost to the field, regardless of the number of instances.
> The updated DocumentBuilder (see https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_0/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java?revision=1388778&view=markup),
used in Lucene 4 BETA & trunk, also assigns docBoost*fieldBoost to the first instance
of the field. Then it sets fieldBoost = docBoost and continues to assign docBoost*fieldBoost
to subsequent instances. Using the example mentioned above, the generated IndexableFields
will get assigned boosts of 5, 5*5, 5*5... 5*5. As Lucene multiplies all the values, 15 instances
of the same field will have a collective boost of 5*25^14.
> This can be demonstrated with the Solr tutorial example by indexing the sample documents
and adding the document 
> {code:xml}
> <add>
> <doc boost="5">
>   <field name="id">Insane score Example. Score = 10E9 </field>
>   <field name="name">Document boost broken for multivalued fields</field>
>   <field name="manu">Thomas Egense and Toke Eskildsen</field>
>   <field name="manu_id_s">Test</field>
>   <field name="cat">bug</field>
>   <field name="features">insane_boost</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>
>   <field name="features">something else</field>  
> </doc>
> </add>
> {code}
> The _manu_ & _features_-fields gets copied to _text_ and a search for _thomas_ matches
the _text_-field with query explanation
> {code:xml}
> <str name="Insane score Example. Score = 10E10 ">
> 2.44373361E10 = (MATCH) weight(text:thomas in 0) [DefaultSimilarity], result of:
>   2.44373361E10 = fieldWeight in 0, product of:
>     1.0 = tf(freq=1.0), with freq of:
>       1.0 = termFreq=1.0
>     3.2512918 = idf(docFreq=3, maxDocs=38)
>     7.5161928E9 = fieldNorm(doc=0)
> </str>
> {code}
> Thomas and I are too pressed for time to attempt a proper patch at the moment, but we
guess that a reversion to the old algorithm of assigning the combined boost to the first instance
and 1.0f to all subsequent instances would work?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message