Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 38645 invoked from network); 9 Mar 2005 19:57:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 9 Mar 2005 19:57:17 -0000 Received: (qmail 58963 invoked by uid 500); 9 Mar 2005 19:57:13 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 58929 invoked by uid 500); 9 Mar 2005 19:57:12 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 58912 invoked by uid 99); 9 Mar 2005 19:57:12 -0000 X-ASF-Spam-Status: No, hits=0.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: domain of yseeley@gmail.com designates 64.233.170.198 as permitted sender) Received: from rproxy.gmail.com (HELO rproxy.gmail.com) (64.233.170.198) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 09 Mar 2005 11:57:12 -0800 Received: by rproxy.gmail.com with SMTP id b11so337655rne for ; Wed, 09 Mar 2005 11:57:10 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:references; b=uaCzm7G5UDahja0DfOW0JJLNEKp2++Hn5pPxpezc4IJB3kD+Uhfx83TrFVB9eFtwl5TSjkUtfEqrg4zfTHpyFSvKuvTbbfrCswoH5TH1V0BqV2VIAOIATTNnubOPmiSySvYpmz52Gri5a2a5Cn9hC1jvl/BE6eAJoH6WUuq64vo= Received: by 10.38.73.45 with SMTP id v45mr1091187rna; Wed, 09 Mar 2005 11:57:10 -0800 (PST) Received: by 10.38.12.53 with HTTP; Wed, 9 Mar 2005 11:57:10 -0800 (PST) Message-ID: Date: Wed, 9 Mar 2005 14:57:10 -0500 From: Yonik Seeley Reply-To: Yonik Seeley To: java-user@lucene.apache.org Subject: Re: Best Practices for Distributing Lucene Indexing and Searching In-Reply-To: <422F43DF.9040604@apache.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit References: <1109713160.18862.109.camel@localhost> <42254963.6000901@apache.org> <422F43DF.9040604@apache.org> X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I'm trying to support an interface where documents can be added one at a time at a high rate (via HTTP POST). You don't know all of the documents ahead of time, so you can't delete them all ahead of time. Given this constraint, it seems like you can do one of two things: 1) collect all the documents to be added, without actually adding them. Then you know the complete list and can do the deletes before the adds 2) tag all the documents as you add them so you can tell old from new. 3) depend on some special ordering that may exist in a lucene index (see Big Questions below) For the 2nd approach to work with duplicates in the same group (doc A added twice before the IndexWriter is closed), it looks like you would have to keep track of what you tagged each individual document with. After the IndexWriter has closed, you could use a term enumerator to go through every document you added and delete anything but the latest (but to find which lucene docid corresponds to which version is more work still...) Big Big Question: Will a term enumerator enumerate in the order documents were added to the index (for a single term of say id:a)? If so, there would be no need to tag at all - simply enumerate and delete all but the last. Another Big Big Question: If the former idea doesn't work, can we depend on the ordering of the docids? Will docs added later always have higher internal docids than ones added earlier? -Yonik On Wed, 09 Mar 2005 10:43:43 -0800, Doug Cutting wrote: > Yonik Seeley wrote: > > This strategy looks very promising. > > > > One drawback is that documents must be added directly to the main > > index for this to be efficient. This is a bit of a problem if there > > is a document uniqueness requirement (a unique id field). > > This is easy to do with a single index. Here's the loop: > > 1. Poll DB for updated and new documents. > 2. Delete all updated docs from an IndexReader & close it. > 3. Add all new & updated to an IndexWriter & close it. > 4. Tell DB that documents are updated. > 5. Checkpoint index. > 6. Repeat. > > Deleting is much faster than adding. > > Doug --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org