lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Keegan" <peterlkee...@gmail.com>
Subject Re: FieldSortedHitQueue enhancement
Date Fri, 30 Mar 2007 00:09:56 GMT
>Peter, how did you achieve 'last wins' as you must presumably remove first
from the PQ?

I implemented 'first wins' because the score is less important than other
fields (distance, in our case), but you make a good point since score may be
more important. How did you implement remove()?

Peter


On 3/29/07, Antony Bowesman <adb@teamware.com> wrote:
>
> I've got a similar duplicate case, but my duplicates are based on an
> external ID
> rather than Doc id so occurs for a single Query.  It's using a custom
> HitCollector but score based, not field sorted.
>
> If my duplicate contains a higher score than one on the PQ I need to
> update the
> stored score with the higher one, so PQ needs a replace() method where the
> stored object.equals() can be used to find the object to delete.  I'm not
> sure
> if there's a way to find the object efficiently in this case other than a
> linear
> search.  I implemented remove().
>
> Peter, how did you achieve 'last wins' as you must presumably remove first
> from
> the PQ?
>
> Antony
>
>
> Peter Keegan wrote:
> > The duplicate check would just be on the doc ID. I'm using TreeSet to
> > detect
> > duplicates with no noticeable affect on performance. The PQ only has to
> be
> > checked for a previous value IFF the element about to be inserted is
> > actually inserted and not dropped because it's less than the least value
> > already in there. So, the TreeSet is never bigger than the size of the
> PQ
> > (typically 25 to a few hundred items), not the size of all hits.
> >
> > Peter
> >
> > On 3/29/07, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
> >>
> >> Hm, removing duplicates (as determined by a value of a specified
> document
> >> field) from the results would be nice.
> >> How would your addition affect performance, considering it has to check
> >> the PQ for a previous value for every candidate hit?
> >>
> >> Otis
> >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> >> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> >>
> >> ----- Original Message ----
> >> From: Peter Keegan <peterlkeegan@gmail.com>
> >> To: java-user@lucene.apache.org
> >> Sent: Thursday, March 29, 2007 9:39:13 AM
> >> Subject: FieldSortedHitQueue enhancement
> >>
> >> This is request for an enhancement to FieldSortedHitQueue/PriorityQueue
> >> that
> >> would prevent duplicate documents from being inserted, or
> alternatively,
> >> allow the application to prevent this (reason explained below). I can
> do
> >> this today by making the 'lessThan' method public and checking the
> queue
> >> before inserting like this:
> >>
> >> if (hq.size() < maxSize) {
> >>    // doc will be inserted into queue - check for duplicate before
> >> inserting
> >> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
> >> (ScoreDoc)hq.top()) {
> >>   // doc will be inserted into queue - check for duplicate before
> >> inserting
> >> } else {
> >>   // doc will not be inserted - no check needed
> >> }
> >>
> >> However, this is just replicating existing code in
> >> PriorityQueue->insert().
> >> An alternative would be to have a method like:
> >>
> >> public boolean wouldBeInserted(ScoreDoc doc)
> >> // returns true if doc would be inserted, without inserting
> >>
> >> The reason for this is that I have some queries that get expanded into
> >> multiple searches and the resulting hits are OR'd together. The queries
> >> contain 'terms' that are not seen by Lucene but are handled by a
> >> HitCollector that uses external data for each document to evaluate
> hits.
> >> The
> >> results from the priority queue should contain no duplicate documents
> >> (first
> >> or last doc wins).
> >>
> >> Do any of these suggestions seem reasonable?. So far, I've been able to
> >> use
> >> Lucene without any modifications, and hope to continue this way.
> >>
> >> Peter
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message