lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Bowesman <...@teamware.com>
Subject Re: FieldSortedHitQueue enhancement
Date Fri, 30 Mar 2007 01:31:08 GMT
Peter Keegan wrote:
> I implemented 'first wins' because the score is less important than other
> fields (distance, in our case), but you make a good point since score 
> may be
> more important. How did you implement remove()?

I've got my own PriorityQueue

     public boolean remove(E o)
     {
         if (o == null)
             return false;

         for (int i = 1; i <= size; i++)
         {
             if (queue[i] == o)
             {
                 removeElement(i);
                 return true;
             }
         }
         return false;
     }

I've got a reference to the original object, so I'm using == to locate it.  I've 
not used equals() as I've not yet worked out whether that will cause me any 
problems with hashing.

Antony

> 
> Peter
> 
> 
> On 3/29/07, Antony Bowesman <adb@teamware.com> wrote:
>>
>> I've got a similar duplicate case, but my duplicates are based on an
>> external ID
>> rather than Doc id so occurs for a single Query.  It's using a custom
>> HitCollector but score based, not field sorted.
>>
>> If my duplicate contains a higher score than one on the PQ I need to
>> update the
>> stored score with the higher one, so PQ needs a replace() method where 
>> the
>> stored object.equals() can be used to find the object to delete.  I'm not
>> sure
>> if there's a way to find the object efficiently in this case other than a
>> linear
>> search.  I implemented remove().
>>
>> Peter, how did you achieve 'last wins' as you must presumably remove 
>> first
>> from
>> the PQ?
>>
>> Antony
>>
>>
>> Peter Keegan wrote:
>> > The duplicate check would just be on the doc ID. I'm using TreeSet to
>> > detect
>> > duplicates with no noticeable affect on performance. The PQ only has to
>> be
>> > checked for a previous value IFF the element about to be inserted is
>> > actually inserted and not dropped because it's less than the least 
>> value
>> > already in there. So, the TreeSet is never bigger than the size of the
>> PQ
>> > (typically 25 to a few hundred items), not the size of all hits.
>> >
>> > Peter
>> >
>> > On 3/29/07, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
>> >>
>> >> Hm, removing duplicates (as determined by a value of a specified
>> document
>> >> field) from the results would be nice.
>> >> How would your addition affect performance, considering it has to 
>> check
>> >> the PQ for a previous value for every candidate hit?
>> >>
>> >> Otis
>> >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> >> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>> >>
>> >> ----- Original Message ----
>> >> From: Peter Keegan <peterlkeegan@gmail.com>
>> >> To: java-user@lucene.apache.org
>> >> Sent: Thursday, March 29, 2007 9:39:13 AM
>> >> Subject: FieldSortedHitQueue enhancement
>> >>
>> >> This is request for an enhancement to 
>> FieldSortedHitQueue/PriorityQueue
>> >> that
>> >> would prevent duplicate documents from being inserted, or
>> alternatively,
>> >> allow the application to prevent this (reason explained below). I can
>> do
>> >> this today by making the 'lessThan' method public and checking the
>> queue
>> >> before inserting like this:
>> >>
>> >> if (hq.size() < maxSize) {
>> >>    // doc will be inserted into queue - check for duplicate before
>> >> inserting
>> >> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
>> >> (ScoreDoc)hq.top()) {
>> >>   // doc will be inserted into queue - check for duplicate before
>> >> inserting
>> >> } else {
>> >>   // doc will not be inserted - no check needed
>> >> }
>> >>
>> >> However, this is just replicating existing code in
>> >> PriorityQueue->insert().
>> >> An alternative would be to have a method like:
>> >>
>> >> public boolean wouldBeInserted(ScoreDoc doc)
>> >> // returns true if doc would be inserted, without inserting
>> >>
>> >> The reason for this is that I have some queries that get expanded into
>> >> multiple searches and the resulting hits are OR'd together. The 
>> queries
>> >> contain 'terms' that are not seen by Lucene but are handled by a
>> >> HitCollector that uses external data for each document to evaluate
>> hits.
>> >> The
>> >> results from the priority queue should contain no duplicate documents
>> >> (first
>> >> or last doc wins).
>> >>
>> >> Do any of these suggestions seem reasonable?. So far, I've been 
>> able to
>> >> use
>> >> Lucene without any modifications, and hope to continue this way.
>> >>
>> >> Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message