lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sravan Kumar <sra...@caavo.com>
Subject Re: Title Search scoring issues with multivalued field & norm
Date Sun, 04 Feb 2018 11:41:38 GMT
Using edismax with different fields for each title will affect the final
scores if the tie paramter is non-zero.

Can we create separate document for each title? The uniqueness won't be for
movie_id but for each title. In this manner, even while using edismax, the
other titles won't affect the score.

Any other way to handle norms in multivalued field?

On Thu, Feb 1, 2018 at 12:24 PM, Sravan Kumar <sravan@caavo.com> wrote:

> @Walter: Perhaps you are right on not to consider stemming. Instead fuzzy
> search will cover these along with the misspellings.
>
> In case of symbols, we want the titles matching the symbols ranked higher
> than the others. Perhaps we can use this field only for boosting.
>
> Certain movies have around 4-6 different aliases based on what our source
> gives and we do not really know what is the max. Is there no other way from
> lucene/solr to use a multivalued field?
>
>
> On Thu, Feb 1, 2018 at 11:06 AM, Walter Underwood <wunder@wunderwood.org>
> wrote:
>
>> I was the first search engineer at Netflix and moved their search from a
>> home-grown engine to Solr. It worked very well with a single title field
>> and aliases.
>>
>> I think your schema is too complicated for movie search.
>>
>> Stemming is not useful. It doesn’t help search and it can hurt. You don’t
>> want the movie “Saw” to match the query “see”.
>>
>> When is it useful to search with symbols? Remove the punctuation.
>>
>> The only movie titles with symbols that caused any challenge were:
>>
>> * Frost/Nixon
>> * .hack//Sign
>> * +/-
>>
>> For the first two, removing punctuation worked fine. For the last one, I
>> hardcoded a translation to “plus/minus” before indexing or querying.
>>
>> Query completion made a huge difference, taking our clickthrough rate
>> from 0.45 to 0.55.
>>
>> Later, we added fuzzy search to handle misspellings.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> > On Jan 31, 2018, at 8:54 PM, Sravan Kumar <sravan@caavo.com> wrote:
>> >
>> > @Tim Casey: Yeah... TFIDFSimilarity weighs towards shorter documents.
>> This
>> > is done through the fieldnorm component in the class. The issue is when
>> the
>> > field is multivalued. Consider the field has two string each of 4
>> tokens.
>> > The fieldNorm from the lucene TFIDFSimilarity class considers the total
>> sum
>> > of these two values i.e 8 for normalizing instead of 4. Hence, the
>> ranking
>> > is distorted.
>> > Regarding the search evaluation, we do have a curated set.
>> >
>> >
>> > On Thu, Feb 1, 2018 at 9:18 AM, Tim Casey <tcasey@gmail.com> wrote:
>> >
>> >> For smaller length documents TFIDFSimilarity will weight towards
>> shorter
>> >> documents.  Another way to say this, if your documents are 5-10 terms,
>> the
>> >> 5 terms are going to win.
>> >> You might think about having per token, or token pair, weight.  I
>> would be
>> >> surprised if there was not something similar out there.  This is a
>> common
>> >> issue with any short text.
>> >> I guess I would think of this as TFICF, where the CF is the corpus
>> >> frequency. You also might want to weight inversely proportional to the
>> age
>> >> of the title, older are less important.  This is assuming people are
>> doing
>> >> searches within some time cluster, newer is more likely.
>> >>
>> >> For some obvious advice, things you probably already know.  This kind
>> of
>> >> search needs some hard measurement to begin to know how to tune it.
>> You
>> >> need to find a reasonable annotated representation.  So, if you took
>> the
>> >> previous months searches where there is a chain of successive
>> searches.  If
>> >> you weighted things differently would you shorten the length of the
>> chain.
>> >> Can you get the click throughs to happen sooner.
>> >>
>> >> Anyway, just my 2 cents....
>> >>
>> >>
>> >> On Wed, Jan 31, 2018 at 6:38 PM, Sravan Kumar <sravan@caavo.com>
>> wrote:
>> >>
>> >>>
>> >>> @Walter: We have 6 fields declared in schema.xml for title each with
>> >>> different type of analyzer. One without processing symbols, other
>> stemmed
>> >>> and other removing  symbols, etc. So, if we have separate fields for
>> each
>> >>> alias it will be that many times the number of final fields declared
>> in
>> >>> schema.xml. And we exactly do not know what is the maximum number of
>> >>> aliases a movie can have.
>> >>> @Walter: I will try this but isn’t there any other way  where I can
>> >> tweak ?
>> >>>
>> >>> @eric: will try this. But it will work only for exact matches.
>> >>>
>> >>>
>> >>>> On Jan 31, 2018, at 10:39 PM, Erick Erickson <
>> erickerickson@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Or use a boost for the phrase, something like
>> >>>> "beauty and the beast"^5
>> >>>>
>> >>>>> On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood <
>> >>> wunder@wunderwood.org> wrote:
>> >>>>> You can use a separate field for title aliases. That is what
I did
>> for
>> >>> Netflix search.
>> >>>>>
>> >>>>> Why disable idf? Disabling tf for titles can be a good idea,
for
>> >>> example the movie “New York, New York” is not twice as much about
New
>> >> York
>> >>> as some other film that just lists it once.
>> >>>>>
>> >>>>> Also, consider using a popularity score as a boost.
>> >>>>>
>> >>>>> wunder
>> >>>>> Walter Underwood
>> >>>>> wunder@wunderwood.org
>> >>>>> http://observer.wunderwood.org/  (my blog)
>> >>>>>
>> >>>>>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar <sravan@caavo.com>
>> wrote:
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>> We are using solr for our movie title search.
>> >>>>>>
>> >>>>>>
>> >>>>>> As it is "title search", this should be treated different
than the
>> >>> normal
>> >>>>>> document search.
>> >>>>>> Hence, we use a modified version of TFIDFSimilarity with
the
>> >> following
>> >>>>>> changes.
>> >>>>>> -  disabled TF & IDF and will only have 1 as value.
>> >>>>>> -  disabled norms by specifying omitNorms as true for all
the
>> fields.
>> >>>>>>
>> >>>>>> There are 6 fields with different analyzers and we make
use of
>> >>> different
>> >>>>>> weights in edismax's qf & pf parameters to match tokens
& boost
>> >>> phrases.
>> >>>>>>
>> >>>>>> But, movies could have aliases and have multiple titles.
So, we
>> made
>> >>> the
>> >>>>>> fields multivalued.
>> >>>>>>
>> >>>>>> Now, consider the following four documents
>> >>>>>> 1>  "Beauty and the Beast"
>> >>>>>> 2>  "The Real Beauty and the Beast"
>> >>>>>> 3>  "Beauty and the Beast", "La bella y la bestia"
>> >>>>>> 4>  "Beauty and the Beast"
>> >>>>>>
>> >>>>>> Note: Document 3 has two titles in it.
>> >>>>>>
>> >>>>>> So, for a query "Beauty and the Beast" and with the above
>> >>> configuration all
>> >>>>>> the documents receive same score. But 1,3,4 should have
got same
>> >> score
>> >>> and
>> >>>>>> document 2 lesser than others.
>> >>>>>>
>> >>>>>> To solve this, we followed what is suggested in the following
>> thread:
>> >>>>>> http://lucene.472066.n3.nabble.com/Influencing-scores-
>> >>> on-values-in-multiValue-fields-td1791651.html
>> >>>>>>
>> >>>>>> Now, the fields which are used to boost are made to use
Norms. And
>> >> for
>> >>>>>> matching norms are disabled. This is to make sure that exact
& near
>> >>> exact
>> >>>>>> matches are rewarded.
>> >>>>>>
>> >>>>>> But, for the same query, we get the following results.
>> >>>>>> query: "Beauty & the Beast"
>> >>>>>> Search Results:
>> >>>>>> 1>  "Beauty and the Beast"
>> >>>>>> 4>  "Beauty and the Beast"
>> >>>>>> 2>  "The Real Beauty and the Beast"
>> >>>>>> 3>  "Beauty and the Beast", "La bella y la bestia"
>> >>>>>>
>> >>>>>> Clearly, the changes have solved only a part of the problem.
The
>> >>> document 3
>> >>>>>> should be ranked/scored higher than document 2.
>> >>>>>>
>> >>>>>> This is because lucene considers the total field length
across all
>> >> the
>> >>>>>> values in a multivalued field for normalization.
>> >>>>>>
>> >>>>>> How do we handle this scenario and make sure that in multivalued
>> >>> fields the
>> >>>>>> normalization is taken care of?
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Regards,
>> >>>>>> Sravan
>> >>>>>
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Sravan
>>
>>
>
>
> --
> Regards,
> Sravan
>



-- 
Regards,
Sravan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message