lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthick Duraisamy Soundararaj <d.s.karth...@gmail.com>
Subject Re: Diversifying Search Results - Custom Collector
Date Mon, 20 Aug 2012 20:52:45 GMT
Hi Mikhail,
                  You are correct.  "[+] show 6 result.."  will work but
it wouldn't suit my requirements. This is a question of user experience
right?

Imagine if the product manager comes to you and says I dont want to see
 "[+] show 6 result.." and I want the results to be diverse but should be
showed like any other search results.

I think grouping does this by two pass collection. First pass, it figures
out all the groups and then in the second  pass, it collects the results
into these groups.


Thanks,
Karthick

On Mon, Aug 20, 2012 at 3:24 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hello,
>
> I don't believe your task can be solved by playing with scoring/collector
> or shuffling.
> For me it's absolutely Grouping usecase (despite I don't really know this
> feature well).
>
> > Grouping cannot solve the problem because I dont want to limit the
> number of results showed based on the grouping field.
>
> I'm not really getting it. why you can set limit to 11 and just show the
> labels like "[+] show 6 result.." or if you have 11 "[+] show more than 10
> .."
>
> If you experience problem with constructing search result page, I can
> suggest submit search request with rows=0&facet.field=BRAND, then your
> algorithm can choose number of necessary items per every brand and submit
> rows=X&fq=BRAND:Y it gives you arbitrarily sizes for "groups".
>
> Will this work for you?
>
>
> On Mon, Aug 20, 2012 at 8:28 PM, Karthick Duraisamy Soundararaj <
> d.s.karthick@gmail.com> wrote:
>
>> Tanguy,
>>               You idea is perfect for cases where there is a too many
>> documents with 80-90% documents having same value for a particular field.
>> As an example, your idea is ideal for, lets say we have 10 documents in
>> total like this,
>>
>>  doc1 : <merchantName> Kellog's </merchantName>
>>  doc2 : <merchantName> Kellog's </merchantName>
>>  doc3 : <merchantName> Kellog's </merchantName>
>>  doc4 : <merchantName> Kellog's </merchantName>
>>  doc5 : <merchantName> Kellog's </merchantName>
>>  doc6 : <merchantName> Kellog's </merchantName>
>>  doc7 : <merchantName> Kellog's </merchantName>
>>  doc8 : <merchantName> Nestle </merchantName>
>>  doc9 : <merchantName> Kellog's </merchantName>
>>  doc10 : <merchantName> Kellog's </merchantName>
>>
>> But I have
>>  doc1 : <merchantName> Maggi </merchantName>
>>  doc2 : <merchantName> Maggi  </merchantName>
>>  doc3 : <merchantName> M&M's </merchantName>
>>  doc4 : <merchantName> M&M's </merchantName>
>>  doc5 : <merchantName> Hershey's </merchantName>
>>  doc6 : <merchantName> Hershey's </merchantName>
>>  doc7 : <merchantName> Nestle </merchantName>
>>  doc8 : <merchantName> Nestle </merchantName>
>>  doc9 : <merchantName> Kellog's </merchantName>
>>  doc10 : <merchantName> Kellog's </merchantName>
>>
>>
>> Thanks,
>> Karthick
>>
>> On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal <tanguy.moal@gmail.com>wrote:
>>
>>> Hello,
>>>
>>> I don't know if that could help, but if I understood your issue, you
>>> have a lot of documents with the same or very close scores. Moreover I
>>> think you get your matches in Merchant order (more or less) because they
>>> must be indexed in that very same order, so solr returns documents of same
>>> scores in insertion order (although there is no contract specifying this)
>>>
>>> You could work around that issue by :
>>> 1/ Turning off tf/idf because you're searching in documents with little
>>> text where only the match counts, but frequencies obviously aren't helping.
>>> 2/ Add a random number to each document at index time, and boost on that
>>> random value at query time, this will shuffle your results, that's probably
>>> the simplest thing to do.
>>>
>>> Hope this helps,
>>>
>>> Tanguy
>>>
>>> 2012/8/20 Karthick Duraisamy Soundararaj <d.s.karthick@gmail.com>
>>>
>>>> Hello Mikhail,
>>>>                         Thank you for the reply. In terms of user
>>>> experience, I want to spread out the products from same brand farther from
>>>> each other, *atleast* in the first 50-100 results we display. I am
>>>> thinking about two different approaches as solution.
>>>>
>>>>                       1. For first few results, display one top scoring
>>>> product of a manufacturer  (For a given field, display the top scoring
>>>> results of the unique field values for the first N matches) . This N could
>>>> be either a percentage relative to total matches or a configurable absolute
>>>> value.
>>>>                       2. Enforce a penalty on  the score for the
>>>> results that have duplicate field values. The penalty can be enforced such
>>>> a way that, the results with higher scores will not be affected as against
>>>> the ones with lower score.
>>>>
>>>> Both of the solutions can be implemented while sorting the documents
>>>> with TopFieldCollector / TopScoreDocCollector.
>>>>
>>>> Does this answer your question?  Please let me know if you have any
>>>> more questions.
>>>>
>>>> Thanks,
>>>> Karthick
>>>>
>>>> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
>>>> mkhludnev@griddynamics.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I've got the problem description below. Can you explain the expected
>>>>> user experience, and/or solution approach before diving into the algorithm
>>>>> design?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
>>>>> karthick.soundararaj@gmail.com> wrote:
>>>>>
>>>>>> My problem is that when there are a lot of documents representing
>>>>>> products,
>>>>>> products from same manufacturer seem to appear in close proximity
in
>>>>>> the
>>>>>> results and therefore, it doesnt provide brand diversity. When you
>>>>>> search
>>>>>> for sofas, you get sofas from a manufacturer A dominating the first
>>>>>> page
>>>>>> while the sofas from manufacturer B dominating the second page, etc.
>>>>>> The
>>>>>> issue here is that a manufacturer tends to describes the different
>>>>>> sofas he
>>>>>> produces the same way and therefore there is a very little difference
>>>>>> between the documents representing two sofas.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours
>>>>> Mikhail Khludnev
>>>>> Tech Lead
>>>>> Grid Dynamics
>>>>>
>>>>> <http://www.griddynamics.com>
>>>>>  <mkhludnev@griddynamics.com>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mkhludnev@griddynamics.com>
>
>

Mime
View raw message