lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Bowesman <...@teamware.com>
Subject Re: FieldSortedHitQueue enhancement
Date Thu, 29 Mar 2007 22:07:12 GMT
I've got a similar duplicate case, but my duplicates are based on an external ID 
rather than Doc id so occurs for a single Query.  It's using a custom 
HitCollector but score based, not field sorted.

If my duplicate contains a higher score than one on the PQ I need to update the 
stored score with the higher one, so PQ needs a replace() method where the 
stored object.equals() can be used to find the object to delete.  I'm not sure 
if there's a way to find the object efficiently in this case other than a linear 
search.  I implemented remove().

Peter, how did you achieve 'last wins' as you must presumably remove first from 
the PQ?

Antony


Peter Keegan wrote:
> The duplicate check would just be on the doc ID. I'm using TreeSet to 
> detect
> duplicates with no noticeable affect on performance. The PQ only has to be
> checked for a previous value IFF the element about to be inserted is
> actually inserted and not dropped because it's less than the least value
> already in there. So, the TreeSet is never bigger than the size of the PQ
> (typically 25 to a few hundred items), not the size of all hits.
> 
> Peter
> 
> On 3/29/07, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
>>
>> Hm, removing duplicates (as determined by a value of a specified document
>> field) from the results would be nice.
>> How would your addition affect performance, considering it has to check
>> the PQ for a previous value for every candidate hit?
>>
>> Otis
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>>
>> ----- Original Message ----
>> From: Peter Keegan <peterlkeegan@gmail.com>
>> To: java-user@lucene.apache.org
>> Sent: Thursday, March 29, 2007 9:39:13 AM
>> Subject: FieldSortedHitQueue enhancement
>>
>> This is request for an enhancement to FieldSortedHitQueue/PriorityQueue
>> that
>> would prevent duplicate documents from being inserted, or alternatively,
>> allow the application to prevent this (reason explained below). I can do
>> this today by making the 'lessThan' method public and checking the queue
>> before inserting like this:
>>
>> if (hq.size() < maxSize) {
>>    // doc will be inserted into queue - check for duplicate before
>> inserting
>> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
>> (ScoreDoc)hq.top()) {
>>   // doc will be inserted into queue - check for duplicate before
>> inserting
>> } else {
>>   // doc will not be inserted - no check needed
>> }
>>
>> However, this is just replicating existing code in
>> PriorityQueue->insert().
>> An alternative would be to have a method like:
>>
>> public boolean wouldBeInserted(ScoreDoc doc)
>> // returns true if doc would be inserted, without inserting
>>
>> The reason for this is that I have some queries that get expanded into
>> multiple searches and the resulting hits are OR'd together. The queries
>> contain 'terms' that are not seen by Lucene but are handled by a
>> HitCollector that uses external data for each document to evaluate hits.
>> The
>> results from the priority queue should contain no duplicate documents
>> (first
>> or last doc wins).
>>
>> Do any of these suggestions seem reasonable?. So far, I've been able to
>> use
>> Lucene without any modifications, and hope to continue this way.
>>
>> Peter
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message