lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthick Duraisamy Soundararaj <d.s.karth...@gmail.com>
Subject Re: Diversifying Search Results - Custom Collector
Date Mon, 20 Aug 2012 16:28:41 GMT
Tanguy,
              You idea is perfect for cases where there is a too many
documents with 80-90% documents having same value for a particular field.
As an example, your idea is ideal for, lets say we have 10 documents in
total like this,

 doc1 : <merchantName> Kellog's </merchantName>
 doc2 : <merchantName> Kellog's </merchantName>
 doc3 : <merchantName> Kellog's </merchantName>
 doc4 : <merchantName> Kellog's </merchantName>
 doc5 : <merchantName> Kellog's </merchantName>
 doc6 : <merchantName> Kellog's </merchantName>
 doc7 : <merchantName> Kellog's </merchantName>
 doc8 : <merchantName> Nestle </merchantName>
 doc9 : <merchantName> Kellog's </merchantName>
 doc10 : <merchantName> Kellog's </merchantName>

But I have
 doc1 : <merchantName> Maggi </merchantName>
 doc2 : <merchantName> Maggi  </merchantName>
 doc3 : <merchantName> M&M's </merchantName>
 doc4 : <merchantName> M&M's </merchantName>
 doc5 : <merchantName> Hershey's </merchantName>
 doc6 : <merchantName> Hershey's </merchantName>
 doc7 : <merchantName> Nestle </merchantName>
 doc8 : <merchantName> Nestle </merchantName>
 doc9 : <merchantName> Kellog's </merchantName>
 doc10 : <merchantName> Kellog's </merchantName>


Thanks,
Karthick

On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal <tanguy.moal@gmail.com> wrote:

> Hello,
>
> I don't know if that could help, but if I understood your issue, you have
> a lot of documents with the same or very close scores. Moreover I think you
> get your matches in Merchant order (more or less) because they must be
> indexed in that very same order, so solr returns documents of same scores
> in insertion order (although there is no contract specifying this)
>
> You could work around that issue by :
> 1/ Turning off tf/idf because you're searching in documents with little
> text where only the match counts, but frequencies obviously aren't helping.
> 2/ Add a random number to each document at index time, and boost on that
> random value at query time, this will shuffle your results, that's probably
> the simplest thing to do.
>
> Hope this helps,
>
> Tanguy
>
> 2012/8/20 Karthick Duraisamy Soundararaj <d.s.karthick@gmail.com>
>
>> Hello Mikhail,
>>                         Thank you for the reply. In terms of user
>> experience, I want to spread out the products from same brand farther from
>> each other, *atleast* in the first 50-100 results we display. I am
>> thinking about two different approaches as solution.
>>
>>                       1. For first few results, display one top scoring
>> product of a manufacturer  (For a given field, display the top scoring
>> results of the unique field values for the first N matches) . This N could
>> be either a percentage relative to total matches or a configurable absolute
>> value.
>>                       2. Enforce a penalty on  the score for the results
>> that have duplicate field values. The penalty can be enforced such a way
>> that, the results with higher scores will not be affected as against the
>> ones with lower score.
>>
>> Both of the solutions can be implemented while sorting the documents with
>> TopFieldCollector / TopScoreDocCollector.
>>
>> Does this answer your question?  Please let me know if you have any more
>> questions.
>>
>> Thanks,
>> Karthick
>>
>> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
>> mkhludnev@griddynamics.com> wrote:
>>
>>> Hello,
>>>
>>> I've got the problem description below. Can you explain the expected
>>> user experience, and/or solution approach before diving into the algorithm
>>> design?
>>>
>>> Thanks
>>>
>>>
>>> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
>>> karthick.soundararaj@gmail.com> wrote:
>>>
>>>> My problem is that when there are a lot of documents representing
>>>> products,
>>>> products from same manufacturer seem to appear in close proximity in the
>>>> results and therefore, it doesnt provide brand diversity. When you
>>>> search
>>>> for sofas, you get sofas from a manufacturer A dominating the first page
>>>> while the sofas from manufacturer B dominating the second page, etc. The
>>>> issue here is that a manufacturer tends to describes the different
>>>> sofas he
>>>> produces the same way and therefore there is a very little difference
>>>> between the documents representing two sofas.
>>>>
>>>
>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> Tech Lead
>>> Grid Dynamics
>>>
>>> <http://www.griddynamics.com>
>>>  <mkhludnev@griddynamics.com>
>>>
>>>
>>
>>
>

Mime
View raw message