lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: removing duplicate Documents from Hits
Date Fri, 01 Oct 2004 19:21:13 GMT
Timm, Andy (ETW) wrote:
> Hello, I've searched on previous posts on this topic but couldn't find an answer.  I
want to query my index (which are a number of 'flattened' Oracle tables) for some criteria,
then return Hits such that there are no Documents that duplicate a particular field.  In the
case where table A has a one-to-many relationship to table B, I get one Document for each
(A1-B1, A1-B2, A1-B3...).  My index needs to have each of these records as 'B' is a searchable
field in the index.  However, after the query is executed, I want my resulting Hits on be
unique on 'A'.  I'm only returning the Oracle object ID, so once I've seen it once I don't
need it again.  It looks like some sort of custom Filter is in order.

I'd suggest a HitCollector that uses a FieldCache of the "A" values to 
check for duplicates, and collect only a the best document id for each 
value of "A".  This would use a bit of RAM, but be very fast.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/HitCollector.html
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/FieldCache.html

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message