lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Morus Walter <morus.wal...@tanto-xipolis.de>
Subject Re: Merging indexes and removing duplicates.
Date Fri, 09 May 2003 07:01:52 GMT
Victor Hadianto writes:
> > Another simple solution would be to just accept duplicates in index, but
> > remove them from the results before returning result set.
> > This should work ok as long as there is a unique field to use for weeding
> > out dups, and if number of duplicates is reasonably low.
> 
> Yes we do have a unique field to identify the document, so I can check if 
> there are two documents with the same id in the result set then they are 
> duplicates and just return the unique documents. This may work.
> 
Well you could also go through the index after merging once and delete
all duplicates.
The API allowes you to loop over all values for a given field and all
documents for a given value. So I'd just loop over all values and delete
all but the first documents.
It's an extra step but it should be fast (especially since
deleting simply means mark as deleted here).
If you optimize the index as a last step, the deleted documents will
be removed from the index.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message