lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthick Duraisamy Soundararaj <d.s.karth...@gmail.com>
Subject Re: Diversifying Search Results - Custom Collector
Date Tue, 21 Aug 2012 13:31:44 GMT
Hi Lance,
                Thanks for your response. Wouldnt randomizing affect
relevancy? Maybe I should explain my problem better:

              Lets say there are 1000 matches for a search of "Sofas". For
the sake of simplcity, lets assume all of these 1000 matches(1000 sofas)
have same Merchant. Then, the solution you suggest and tanguy suggest on
randomizing the result order would be perfect. However, my case is
different. My case is that, out of these 1000 matches, there are about 100
unique manufacturer and each of them make 10 sofas each. So now, whenever
one sofa from a particular manufacturer is displayed, other sofas from the
manufacturer is appearing close together as well. Please note that the
problem is not about relevancy as sofas are very relevant but just because
they are described the more or less the same way with same words which make
them appear close together in the result set.

Thats why I want to have a policy while sorting which is something like *"Find
all the highest scoring document for each manufacuturer in the current
result set and place them ahead of the rest. Here as you can see, the idea
is to display one product from each unique manufacturer first"*. Now to
decide how many unique manufacturer to show before the normal ordering can
be determined relative to the total number of unique manufacturers. Like
for example, if there are 90 unique manufacturers, display products from 45
(approx 50%) first before displaying the rest of the products.

Does this make sense?

Thanks,
Karthick



On Mon, Aug 20, 2012 at 5:05 PM, Lance Norskog <goksron@gmail.com> wrote:

> If you do the same search twice in a row, the second search takes < 3
> ms. Try finding your base result set and then augmenting it with a
> second search within the first result set.


> You can sort from a function call. Sorting is multi-level, so you can
> make one of the levels random.
>
> Does this app have to support paging the search list? If so, do you
> plan to do a second search for the next 5 results?  Complex results
> shuffling can make this hard. Also, I don't know exactly how random
> works, whether it generates the same random order twice. This would
> make paging impossible.
>
> On Mon, Aug 20, 2012 at 1:52 PM, Karthick Duraisamy Soundararaj
> <d.s.karthick@gmail.com> wrote:
> > Hi Mikhail,
> >                   You are correct.  "[+] show 6 result.."  will work but
> it
> > wouldn't suit my requirements. This is a question of user experience
> right?
> >
> > Imagine if the product manager comes to you and says I dont want to see
> > "[+] show 6 result.." and I want the results to be diverse but should be
> > showed like any other search results.
> >
> > I think grouping does this by two pass collection. First pass, it figures
> > out all the groups and then in the second  pass, it collects the results
> > into these groups.
> >
> >
> > Thanks,
> > Karthick
> >
> > On Mon, Aug 20, 2012 at 3:24 PM, Mikhail Khludnev
> > <mkhludnev@griddynamics.com> wrote:
> >>
> >> Hello,
> >>
> >> I don't believe your task can be solved by playing with
> scoring/collector
> >> or shuffling.
> >> For me it's absolutely Grouping usecase (despite I don't really know
> this
> >> feature well).
> >>
> >> > Grouping cannot solve the problem because I dont want to limit the
> >> > number of results showed based on the grouping field.
> >>
> >> I'm not really getting it. why you can set limit to 11 and just show the
> >> labels like "[+] show 6 result.." or if you have 11 "[+] show more than
> 10
> >> .."
> >>
> >> If you experience problem with constructing search result page, I can
> >> suggest submit search request with rows=0&facet.field=BRAND, then your
> >> algorithm can choose number of necessary items per every brand and
> submit
> >> rows=X&fq=BRAND:Y it gives you arbitrarily sizes for "groups".
> >>
> >> Will this work for you?
> >>
> >>
> >> On Mon, Aug 20, 2012 at 8:28 PM, Karthick Duraisamy Soundararaj
> >> <d.s.karthick@gmail.com> wrote:
> >>>
> >>> Tanguy,
> >>>               You idea is perfect for cases where there is a too many
> >>> documents with 80-90% documents having same value for a particular
> field. As
> >>> an example, your idea is ideal for, lets say we have 10 documents in
> total
> >>> like this,
> >>>
> >>>  doc1 : <merchantName> Kellog's </merchantName>
> >>>  doc2 : <merchantName> Kellog's </merchantName>
> >>>  doc3 : <merchantName> Kellog's </merchantName>
> >>>  doc4 : <merchantName> Kellog's </merchantName>
> >>>  doc5 : <merchantName> Kellog's </merchantName>
> >>>  doc6 : <merchantName> Kellog's </merchantName>
> >>>  doc7 : <merchantName> Kellog's </merchantName>
> >>>  doc8 : <merchantName> Nestle </merchantName>
> >>>  doc9 : <merchantName> Kellog's </merchantName>
> >>>  doc10 : <merchantName> Kellog's </merchantName>
> >>>
> >>> But I have
> >>>  doc1 : <merchantName> Maggi </merchantName>
> >>>  doc2 : <merchantName> Maggi  </merchantName>
> >>>  doc3 : <merchantName> M&M's </merchantName>
> >>>  doc4 : <merchantName> M&M's </merchantName>
> >>>  doc5 : <merchantName> Hershey's </merchantName>
> >>>  doc6 : <merchantName> Hershey's </merchantName>
> >>>  doc7 : <merchantName> Nestle </merchantName>
> >>>  doc8 : <merchantName> Nestle </merchantName>
> >>>  doc9 : <merchantName> Kellog's </merchantName>
> >>>  doc10 : <merchantName> Kellog's </merchantName>
> >>>
> >>>
> >>> Thanks,
> >>> Karthick
> >>>
> >>> On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal <tanguy.moal@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> I don't know if that could help, but if I understood your issue, you
> >>>> have a lot of documents with the same or very close scores. Moreover
> I think
> >>>> you get your matches in Merchant order (more or less) because they
> must be
> >>>> indexed in that very same order, so solr returns documents of same
> scores in
> >>>> insertion order (although there is no contract specifying this)
> >>>>
> >>>> You could work around that issue by :
> >>>> 1/ Turning off tf/idf because you're searching in documents with
> little
> >>>> text where only the match counts, but frequencies obviously aren't
> helping.
> >>>> 2/ Add a random number to each document at index time, and boost on
> that
> >>>> random value at query time, this will shuffle your results, that's
> probably
> >>>> the simplest thing to do.
> >>>>
> >>>> Hope this helps,
> >>>>
> >>>> Tanguy
> >>>>
> >>>> 2012/8/20 Karthick Duraisamy Soundararaj <d.s.karthick@gmail.com>
> >>>>>
> >>>>> Hello Mikhail,
> >>>>>                         Thank you for the reply. In terms of user
> >>>>> experience, I want to spread out the products from same brand
> farther from
> >>>>> each other, atleast in the first 50-100 results we display. I am
> thinking
> >>>>> about two different approaches as solution.
> >>>>>
> >>>>>                       1. For first few results, display one top
> scoring
> >>>>> product of a manufacturer  (For a given field, display the top
> scoring
> >>>>> results of the unique field values for the first N matches) . This
N
> could
> >>>>> be either a percentage relative to total matches or a configurable
> absolute
> >>>>> value.
> >>>>>                       2. Enforce a penalty on  the score for the
> >>>>> results that have duplicate field values. The penalty can be
> enforced such a
> >>>>> way that, the results with higher scores will not be affected as
> against the
> >>>>> ones with lower score.
> >>>>>
> >>>>> Both of the solutions can be implemented while sorting the documents
> >>>>> with TopFieldCollector / TopScoreDocCollector.
> >>>>>
> >>>>> Does this answer your question?  Please let me know if you have
any
> >>>>> more questions.
> >>>>>
> >>>>> Thanks,
> >>>>> Karthick
> >>>>>
> >>>>> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev
> >>>>> <mkhludnev@griddynamics.com> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I've got the problem description below. Can you explain the
expected
> >>>>>> user experience, and/or solution approach before diving into
the
> algorithm
> >>>>>> design?
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj
> >>>>>> <karthick.soundararaj@gmail.com> wrote:
> >>>>>>>
> >>>>>>> My problem is that when there are a lot of documents representing
> >>>>>>> products,
> >>>>>>> products from same manufacturer seem to appear in close
proximity
> in
> >>>>>>> the
> >>>>>>> results and therefore, it doesnt provide brand diversity.
When you
> >>>>>>> search
> >>>>>>> for sofas, you get sofas from a manufacturer A dominating
the first
> >>>>>>> page
> >>>>>>> while the sofas from manufacturer B dominating the second
page,
> etc.
> >>>>>>> The
> >>>>>>> issue here is that a manufacturer tends to describes the
different
> >>>>>>> sofas he
> >>>>>>> produces the same way and therefore there is a very little
> difference
> >>>>>>> between the documents representing two sofas.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Sincerely yours
> >>>>>> Mikhail Khludnev
> >>>>>> Tech Lead
> >>>>>> Grid Dynamics
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> Tech Lead
> >> Grid Dynamics
> >>
> >>
> >>
> >
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
**
*
*

Mime
View raw message