lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <>
Subject Re: Merging indexes and removing duplicates.
Date Fri, 09 May 2003 03:18:56 GMT
On Thursday 08 May 2003 20:14, David Medinets wrote:
> Out of curiosity is their some unique identifier for each document? Even
> so, it seems like you need a central database (of some type) to handle
> determining if a given document is indexed. If Lucene does the check, won't
> it need to ask 15 different machines for the 'IsIndexedAlready' answer?
> I'd use a servlet as a facade to a static hashmap (a singleton, of course)
> which would act as a clearinghouse. You'd need two servlets with access to
> the same hashmap. The first servlet inserts the document into the hashmap
> and the second servlet handles the query.

Another simple solution would be to just accept duplicates in index, but 
remove them from the results before returning result set.
This should work ok as long as there is a unique field to use for weeding out 
dups, and if number of duplicates is reasonably low.

Also, if originally work is split to workers by a single controller entity, 
that entity might be able to check for duplicates reasonably efficiently... 
but it sounded like sheer amount of documents to handle leads to huge number 
of ids to store, for checking duplicates.

-+ Tatu +-

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message