lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: FieldSortedHitQueue enhancement
Date Thu, 29 Mar 2007 14:48:00 GMT
Hm, removing duplicates (as determined by a value of a specified document field) from the results
would be nice.
How would your addition affect performance, considering it has to check the PQ for a previous
value for every candidate hit?

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Peter Keegan <peterlkeegan@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, March 29, 2007 9:39:13 AM
Subject: FieldSortedHitQueue enhancement

This is request for an enhancement to FieldSortedHitQueue/PriorityQueue that
would prevent duplicate documents from being inserted, or alternatively,
allow the application to prevent this (reason explained below). I can do
this today by making the 'lessThan' method public and checking the queue
before inserting like this:

if (hq.size() < maxSize) {
   // doc will be inserted into queue - check for duplicate before inserting
} else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
(ScoreDoc)hq.top()) {
  // doc will be inserted into queue - check for duplicate before inserting
} else {
  // doc will not be inserted - no check needed
}

However, this is just replicating existing code in PriorityQueue->insert().
An alternative would be to have a method like:

public boolean wouldBeInserted(ScoreDoc doc)
// returns true if doc would be inserted, without inserting

The reason for this is that I have some queries that get expanded into
multiple searches and the resulting hits are OR'd together. The queries
contain 'terms' that are not seen by Lucene but are handled by a
HitCollector that uses external data for each document to evaluate hits. The
results from the priority queue should contain no duplicate documents (first
or last doc wins).

Do any of these suggestions seem reasonable?. So far, I've been able to use
Lucene without any modifications, and hope to continue this way.

Peter




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message