lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: removing duplicate Documents from Hits
Date Fri, 01 Oct 2004 19:21:13 GMT
Timm, Andy (ETW) wrote:
> Hello, I've searched on previous posts on this topic but couldn't find an answer.  I
want to query my index (which are a number of 'flattened' Oracle tables) for some criteria,
then return Hits such that there are no Documents that duplicate a particular field.  In the
case where table A has a one-to-many relationship to table B, I get one Document for each
(A1-B1, A1-B2, A1-B3...).  My index needs to have each of these records as 'B' is a searchable
field in the index.  However, after the query is executed, I want my resulting Hits on be
unique on 'A'.  I'm only returning the Oracle object ID, so once I've seen it once I don't
need it again.  It looks like some sort of custom Filter is in order.

I'd suggest a HitCollector that uses a FieldCache of the "A" values to 
check for duplicates, and collect only a the best document id for each 
value of "A".  This would use a bit of RAM, but be very fast.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message