Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Content-Type: text/plain;
  charset="iso-8859-1"
From: Victor Hadianto <victorh@nuix.com.au>
Organization: NUIX Pty. Ltd.
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Subject: Re: Merging indexes and removing duplicates.
Date: Fri, 9 May 2003 14:00:18 +1000
User-Agent: KMail/1.4.3
References: <200305091052.40144@bah>
 <05c101c315d0$ba51a6c0$6422a8c0@medined01>
 <200305082118.56501.tatu@hypermall.net>
In-Reply-To: <200305082118.56501.tatu@hypermall.net>
Massage-Id: <13921192.1322@nuix.com.au>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Message-Id: <200305091400.18982@bah>

> Another simple solution would be to just accept duplicates in index, but
> remove them from the results before returning result set.
> This should work ok as long as there is a unique field to use for weeding
> out dups, and if number of duplicates is reasonably low.

Yes we do have a unique field to identify the document, so I can check if 
there are two documents with the same id in the result set then they are 
duplicates and just return the unique documents. This may work.

The only issues here is when the number of "workers" increase the number of 
duplicates increase as well. Having said that, we probably won't have more 
than 16 workers anyway.

> Also, if originally work is split to workers by a single controller entity,
> that entity might be able to check for duplicates reasonably efficiently...
> but it sounded like sheer amount of documents to handle leads to huge
> number of ids to store, for checking duplicates.

Hmm unfortunately this won't work because of the nature of the system. Each 
worker doesn't know what the other are working and there is no single entity 
that control the distribution of works to the worker. To make thing 
complicated each worker can be doing different stage (not only doing Lucene 
indexing).

victor


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org