Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 80218 invoked from network); 9 May 2003 04:06:51 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 9 May 2003 04:06:51 -0000 Received: (qmail 13156 invoked by uid 97); 9 May 2003 04:09:04 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 13149 invoked from network); 9 May 2003 04:09:03 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 9 May 2003 04:09:03 -0000 Received: (qmail 79957 invoked by uid 500); 9 May 2003 04:06:47 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 79946 invoked from network); 9 May 2003 04:06:47 -0000 Received: from 251.017.dsl.syd.iprimus.net.au (HELO file1.syd.nuix.com.au) (210.50.55.251) by daedalus.apache.org with SMTP; 9 May 2003 04:06:47 -0000 Received: from host86.syd.nuix.com.au (host86.syd.nuix.com.au [192.168.222.86]) by file1.syd.nuix.com.au (Postfix) with ESMTP id 58A02106A72 for ; Fri, 9 May 2003 14:09:53 +1000 (EST) Content-Type: text/plain; charset="iso-8859-1" From: Victor Hadianto Organization: NUIX Pty. Ltd. To: "Lucene Users List" Subject: Re: Merging indexes and removing duplicates. Date: Fri, 9 May 2003 14:00:18 +1000 User-Agent: KMail/1.4.3 References: <200305091052.40144@bah> <05c101c315d0$ba51a6c0$6422a8c0@medined01> <200305082118.56501.tatu@hypermall.net> In-Reply-To: <200305082118.56501.tatu@hypermall.net> Massage-Id: <13921192.1322@nuix.com.au> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-Id: <200305091400.18982@bah> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N > Another simple solution would be to just accept duplicates in index, but > remove them from the results before returning result set. > This should work ok as long as there is a unique field to use for weeding > out dups, and if number of duplicates is reasonably low. Yes we do have a unique field to identify the document, so I can check if there are two documents with the same id in the result set then they are duplicates and just return the unique documents. This may work. The only issues here is when the number of "workers" increase the number of duplicates increase as well. Having said that, we probably won't have more than 16 workers anyway. > Also, if originally work is split to workers by a single controller entity, > that entity might be able to check for duplicates reasonably efficiently... > but it sounded like sheer amount of documents to handle leads to huge > number of ids to store, for checking duplicates. Hmm unfortunately this won't work because of the nature of the system. Each worker doesn't know what the other are working and there is no single entity that control the distribution of works to the worker. To make thing complicated each worker can be doing different stage (not only doing Lucene indexing). victor --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org