lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Re: Why does index boosting a field to 2.0f on a document have such a dramatic effect
Date Thu, 04 Apr 2013 22:45:46 GMT
On 04/04/2013 23:26, Chris Hostetter wrote:
> : At index time I boost the alias field of a small set of documents, setting the
> : boost to 2.0f, which I thought meant equivalent to doubling the score this doc
> : would get over another doc, everything else being equal.
>
> 1) you haven't shown us enough details to be certian, but based on the
> code you've provied it looks like you are adding a boost for *each* field
> instance named "alias" if the value of artistGuid is in your
> artistGuIdSet...
>
> :         if(artistGuIdSet.contains(artistGuid)) {
> :             for(IndexableField indexablefield:doc.getFields())
> :             {
> : if(indexablefield.name().equals(ArtistIndexField.ALIAS.getName()))
> :                 {
> :                     Field field = (Field)indexablefield;
> :                     field.setBoost(ARTIST_DOC_BOOST);
>
> ...so a doc with N values in the "alias" field is going to get a field
> boost of N*2.
I was converting a document boost from lucene 3 code. For a particular  
document I only call setBoost() once, however the problem artists do 
have a number of aliases I thought when you add multiple values 
independently to one field its still treated as one field but is lucene 
4 now treating as seperate fields so I end up calling field.setBoost() 
for each alias I have added to the alias field ?
> 2) Looking at the URL you mentioned
>
> : http://search.musicbrainz.org/?type=artist&query=Jean&explain=true
>
> ...the debug explanation currently produced by that URL says...
>
> 6.4894321E10 = (MATCH) weight(alias:jean in 7610) [MusicbrainzSimilarity], result of:
>     ...
>     7.5161928E9 = fieldNorm(doc=7610)
>
> ou need to look at your "MusicbrainzSimilarity" class and it's fieldNorm
> method to determine for certain why it's producing such large values.  we
> have no idea how that's implemented.
The MusicBrainz Similarity class aims to solve another issue with 
aliases, that a field with many aliases has a disadvantage in scoring 
with one with few aliases, I dont think Im doing anything silly
regarding the boost here am i ?

package org.musicbrainz.search.analysis;

import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.index.Norm;
import org.apache.lucene.search.similarities.DefaultSimilarity;

/**
  * Calculates a score for a match, overridden to deal with problems with alias fields in
artist and label indexes
  */
//TODO in Lucene 4.1 we can now use PerFieldSimailrityWrapper so that we only oerform this
on fields that need it, with
//current code tf() is performed on every field because we are not passed fieldname


public class MusicbrainzSimilarity extends DefaultSimilarity
{
    /**
      * Calculates a value which is inversely proportional to the number of terms in the field.
When multiple
      * aliases are added to an artist (or label) it is seen as one field, so artists with
many aliases can be
      * disadvantaged against when the matching alias is radically different to other aliases.
      *
      * @param state
      * @return
      */
     @Override
     public float lengthNorm(FieldInvertState state) {

         if (state.getName().equals("alias"))
         {
             if(state.getLength()>=3) {
                 return state.getBoost() * 0.578f; //Same result as normal calc if field had
three terms the most common scenario
             }
             else
             {
                 return super.lengthNorm(state);
             }
         }
         else
         {
             return super.lengthNorm(state);
         }
     }

     /**
      * This method calculates a value based on how many times the search term was found in
the field. Because
      * we have only short fields the only real case (apart from rare exceptions like Duran
Duran Duran) whereby
      * the term term is found more than twice would be when
      * a search term matches multiples aliases, to remove the bias this gives towards artists/labels
with
      * many aliases we limit the value to what would be returned for a two term match.
      *
      * Note: would prefer to do this just for alias field, but the field is not passed as
a parameter.
      * @param freq
      * @return score component
      */
     @Override
     public float tf(float freq) {
         if (freq > 2.0f) {
             return 1.41f; //Same result as if matched term twice

         } else {
             return super.tf(freq);
         }
     }
}

>
> -Hoss
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message