lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sravan Kumar <sra...@caavo.com>
Subject Re: Title Search scoring issues with multivalued field & norm
Date Thu, 01 Feb 2018 02:38:59 GMT

@Walter: We have 6 fields declared in schema.xml for title each with different type of analyzer.
One without processing symbols, other stemmed and other removing  symbols, etc. So, if we
have separate fields for each alias it will be that many times the number of final fields
declared in schema.xml. And we exactly do not know what is the maximum number of aliases a
movie can have. 
@Walter: I will try this but isn’t there any other way  where I can tweak ?

@eric: will try this. But it will work only for exact matches. 


> On Jan 31, 2018, at 10:39 PM, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> Or use a boost for the phrase, something like
> "beauty and the beast"^5
> 
>> On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood <wunder@wunderwood.org> wrote:
>> You can use a separate field for title aliases. That is what I did for Netflix search.
>> 
>> Why disable idf? Disabling tf for titles can be a good idea, for example the movie
“New York, New York” is not twice as much about New York as some other film that just
lists it once.
>> 
>> Also, consider using a popularity score as a boost.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar <sravan@caavo.com> wrote:
>>> 
>>> Hi,
>>> We are using solr for our movie title search.
>>> 
>>> 
>>> As it is "title search", this should be treated different than the normal
>>> document search.
>>> Hence, we use a modified version of TFIDFSimilarity with the following
>>> changes.
>>> -  disabled TF & IDF and will only have 1 as value.
>>> -  disabled norms by specifying omitNorms as true for all the fields.
>>> 
>>> There are 6 fields with different analyzers and we make use of different
>>> weights in edismax's qf & pf parameters to match tokens & boost phrases.
>>> 
>>> But, movies could have aliases and have multiple titles. So, we made the
>>> fields multivalued.
>>> 
>>> Now, consider the following four documents
>>> 1>  "Beauty and the Beast"
>>> 2>  "The Real Beauty and the Beast"
>>> 3>  "Beauty and the Beast", "La bella y la bestia"
>>> 4>  "Beauty and the Beast"
>>> 
>>> Note: Document 3 has two titles in it.
>>> 
>>> So, for a query "Beauty and the Beast" and with the above configuration all
>>> the documents receive same score. But 1,3,4 should have got same score and
>>> document 2 lesser than others.
>>> 
>>> To solve this, we followed what is suggested in the following thread:
>>> http://lucene.472066.n3.nabble.com/Influencing-scores-on-values-in-multiValue-fields-td1791651.html
>>> 
>>> Now, the fields which are used to boost are made to use Norms. And for
>>> matching norms are disabled. This is to make sure that exact & near exact
>>> matches are rewarded.
>>> 
>>> But, for the same query, we get the following results.
>>> query: "Beauty & the Beast"
>>> Search Results:
>>> 1>  "Beauty and the Beast"
>>> 4>  "Beauty and the Beast"
>>> 2>  "The Real Beauty and the Beast"
>>> 3>  "Beauty and the Beast", "La bella y la bestia"
>>> 
>>> Clearly, the changes have solved only a part of the problem. The document 3
>>> should be ranked/scored higher than document 2.
>>> 
>>> This is because lucene considers the total field length across all the
>>> values in a multivalued field for normalization.
>>> 
>>> How do we handle this scenario and make sure that in multivalued fields the
>>> normalization is taken care of?
>>> 
>>> 
>>> --
>>> Regards,
>>> Sravan
>> 

Mime
View raw message