lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Khludnev <mkhlud...@griddynamics.com>
Subject Re: Diversifying Search Results - Custom Collector
Date Mon, 20 Aug 2012 19:24:50 GMT
Hello,

I don't believe your task can be solved by playing with scoring/collector
or shuffling.
For me it's absolutely Grouping usecase (despite I don't really know this
feature well).

> Grouping cannot solve the problem because I dont want to limit the number
of results showed based on the grouping field.

I'm not really getting it. why you can set limit to 11 and just show the
labels like "[+] show 6 result.." or if you have 11 "[+] show more than 10
.."

If you experience problem with constructing search result page, I can
suggest submit search request with rows=0&facet.field=BRAND, then your
algorithm can choose number of necessary items per every brand and submit
rows=X&fq=BRAND:Y it gives you arbitrarily sizes for "groups".

Will this work for you?

On Mon, Aug 20, 2012 at 8:28 PM, Karthick Duraisamy Soundararaj <
d.s.karthick@gmail.com> wrote:

> Tanguy,
>               You idea is perfect for cases where there is a too many
> documents with 80-90% documents having same value for a particular field.
> As an example, your idea is ideal for, lets say we have 10 documents in
> total like this,
>
>  doc1 : <merchantName> Kellog's </merchantName>
>  doc2 : <merchantName> Kellog's </merchantName>
>  doc3 : <merchantName> Kellog's </merchantName>
>  doc4 : <merchantName> Kellog's </merchantName>
>  doc5 : <merchantName> Kellog's </merchantName>
>  doc6 : <merchantName> Kellog's </merchantName>
>  doc7 : <merchantName> Kellog's </merchantName>
>  doc8 : <merchantName> Nestle </merchantName>
>  doc9 : <merchantName> Kellog's </merchantName>
>  doc10 : <merchantName> Kellog's </merchantName>
>
> But I have
>  doc1 : <merchantName> Maggi </merchantName>
>  doc2 : <merchantName> Maggi  </merchantName>
>  doc3 : <merchantName> M&M's </merchantName>
>  doc4 : <merchantName> M&M's </merchantName>
>  doc5 : <merchantName> Hershey's </merchantName>
>  doc6 : <merchantName> Hershey's </merchantName>
>  doc7 : <merchantName> Nestle </merchantName>
>  doc8 : <merchantName> Nestle </merchantName>
>  doc9 : <merchantName> Kellog's </merchantName>
>  doc10 : <merchantName> Kellog's </merchantName>
>
>
> Thanks,
> Karthick
>
> On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal <tanguy.moal@gmail.com>wrote:
>
>> Hello,
>>
>> I don't know if that could help, but if I understood your issue, you have
>> a lot of documents with the same or very close scores. Moreover I think you
>> get your matches in Merchant order (more or less) because they must be
>> indexed in that very same order, so solr returns documents of same scores
>> in insertion order (although there is no contract specifying this)
>>
>> You could work around that issue by :
>> 1/ Turning off tf/idf because you're searching in documents with little
>> text where only the match counts, but frequencies obviously aren't helping.
>> 2/ Add a random number to each document at index time, and boost on that
>> random value at query time, this will shuffle your results, that's probably
>> the simplest thing to do.
>>
>> Hope this helps,
>>
>> Tanguy
>>
>> 2012/8/20 Karthick Duraisamy Soundararaj <d.s.karthick@gmail.com>
>>
>>> Hello Mikhail,
>>>                         Thank you for the reply. In terms of user
>>> experience, I want to spread out the products from same brand farther from
>>> each other, *atleast* in the first 50-100 results we display. I am
>>> thinking about two different approaches as solution.
>>>
>>>                       1. For first few results, display one top scoring
>>> product of a manufacturer  (For a given field, display the top scoring
>>> results of the unique field values for the first N matches) . This N could
>>> be either a percentage relative to total matches or a configurable absolute
>>> value.
>>>                       2. Enforce a penalty on  the score for the results
>>> that have duplicate field values. The penalty can be enforced such a way
>>> that, the results with higher scores will not be affected as against the
>>> ones with lower score.
>>>
>>> Both of the solutions can be implemented while sorting the documents
>>> with TopFieldCollector / TopScoreDocCollector.
>>>
>>> Does this answer your question?  Please let me know if you have any more
>>> questions.
>>>
>>> Thanks,
>>> Karthick
>>>
>>> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
>>> mkhludnev@griddynamics.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I've got the problem description below. Can you explain the expected
>>>> user experience, and/or solution approach before diving into the algorithm
>>>> design?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
>>>> karthick.soundararaj@gmail.com> wrote:
>>>>
>>>>> My problem is that when there are a lot of documents representing
>>>>> products,
>>>>> products from same manufacturer seem to appear in close proximity in
>>>>> the
>>>>> results and therefore, it doesnt provide brand diversity. When you
>>>>> search
>>>>> for sofas, you get sofas from a manufacturer A dominating the first
>>>>> page
>>>>> while the sofas from manufacturer B dominating the second page, etc.
>>>>> The
>>>>> issue here is that a manufacturer tends to describes the different
>>>>> sofas he
>>>>> produces the same way and therefore there is a very little difference
>>>>> between the documents representing two sofas.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>> Tech Lead
>>>> Grid Dynamics
>>>>
>>>> <http://www.griddynamics.com>
>>>>  <mkhludnev@griddynamics.com>
>>>>
>>>>
>>>
>>>
>>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mkhludnev@griddynamics.com>

Mime
View raw message