lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthick Duraisamy Soundararaj <>
Subject Diversifying Search Results - Custom Collector
Date Fri, 17 Aug 2012 22:50:14 GMT
Hi all,
          I know this is a bit long description & so thanks in advance for

I am trying to implement a custom collector, whose job is to diversify the
results based on a field.  Grouping cannot solve the problem because I dont
want to limit the number of results showed based on the grouping field. The
requirement is similar to the discussion here
the problem I am trying to solve is not the same. I dont need negative
boosts. Neither do I have a too many documents by same author.

My problem is that when there are a lot of documents representing products,
products from same manufacturer seem to appear in close proximity in the
results and therefore, it doesnt provide brand diversity. When you search
for sofas, you get sofas from a manufacturer A dominating the first page
while the sofas from manufacturer B dominating the second page, etc. The
issue here is that a manufacturer tends to describes the different sofas he
produces the same way and therefore there is a very little difference
between the documents representing two sofas.

I dont want to use grouping as I dont want to limit the number of products
from a manufacturer.

I am thinking about implementing a custom collector to diversify the
results by enforcing a penalty on scores of documents before adding them to
the priority queue(FieldCahcheHitQueue). I am not exactly sure as how I am
going to enforce this penalty but I seem to be needing two things:
                           1.  A way to manipulate the score during
collection time(TopFieldCollector) after the default scorer has done its
job. The comparator seems to use getScore in its copy and setBottom methods.
                           2.  Hold a reference to all the entries in the
priorityQueue ( HashMap<diversifyingField, Entry> ).
                           3.  Reorder the heap after altering the score.

I would appreciate if anyone has suggestion on the best way to do these :
                          1.  Whats the best way to manipulate the score
from within the collector so they get reflected in the comparator?
                          2.  Do you think it going to degrade the
performance terribly if I hold on to references to all the entries of the
                          3.  Is it a terrible idea to reorder the heap?
 Reordering would introduce two operations for duplicate field values
                                     Based on the lookup from the
priorityQueue, change the score of an entry. So this change could be
anywhere, to ensure heap ordering following needs to be done
                                                     float tmpScore =

 entryToBeModified.score = MaxScore + (reversMul *
entryToBeModified.score )
                                                     pq.Top().score =
tmp.score * *penalty  /* this is the new penalty to update the score */*

I looked at function queries but a function query doesnt seem to have any
knowledge of ordering. So I don see a way to create a custom function query
to achieve this.

I would like to make it as modular as possible and I promise contribute it
back :) !  I want it to be extensible, clean and pluggable. Please do let
me know if you need any more information or if you feel there is a better
way to achieve the functionality.


View raw message