Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (hermes.apache.org: domain of yseeley@gmail.com designates
 64.233.170.198 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:references;
        b=uaCzm7G5UDahja0DfOW0JJLNEKp2++Hn5pPxpezc4IJB3kD+Uhfx83TrFVB9eFtwl5TSjkUtfEqrg4zfTHpyFSvKuvTbbfrCswoH5TH1V0BqV2VIAOIATTNnubOPmiSySvYpmz52Gri5a2a5Cn9hC1jvl/BE6eAJoH6WUuq64vo=
Message-ID: <c68e391705030911579311055@mail.gmail.com>
Date: Wed, 9 Mar 2005 14:57:10 -0500
From: Yonik Seeley <yseeley@gmail.com>
Reply-To: Yonik Seeley <yseeley@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Best Practices for Distributing Lucene Indexing and Searching
In-Reply-To: <422F43DF.9040604@apache.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
References: <1109713160.18862.109.camel@localhost>
	 <c68e391705030119457d39b3d8@mail.gmail.com>
	 <42254963.6000901@apache.org>
	 <c68e39170503090836559d334@mail.gmail.com>
	 <422F43DF.9040604@apache.org>

I'm trying to support an interface where documents can be added one at
a time at a high rate (via HTTP POST).  You don't know all of the
documents ahead of time, so you can't delete them all ahead of time.

Given this constraint, it seems like you can do one of two things:
1) collect all the documents to be added, without actually adding
them.  Then you know the complete list and can do the deletes before
the adds
2) tag all the documents as you add them so you can tell old from new.
3) depend on some special ordering that may exist in a lucene index
(see Big Questions below)

For the 2nd approach to work with duplicates in the same group (doc A
added twice before the IndexWriter is closed), it looks like you would
have to keep track of what you tagged each individual document with. 
After the IndexWriter has closed, you could use a term enumerator to
go through every document you added and delete anything but the latest
(but to find which lucene docid corresponds to which version is more
work still...)

Big Big Question:
Will a term enumerator enumerate in the order documents were added to
the index (for a single term of say id:a)?  If so, there would be no
need to tag at all - simply enumerate and delete all but the last.

Another Big Big Question:
If the former idea doesn't work, can we depend on the ordering of the
docids?  Will docs added later always have higher internal docids than
ones added earlier?

-Yonik


On Wed, 09 Mar 2005 10:43:43 -0800, Doug Cutting <cutting@apache.org> wrote:
> Yonik Seeley wrote:
> > This strategy looks very promising.
> >
> > One drawback is that documents must be added directly to the main
> > index for this to be efficient.  This is a bit of a problem if there
> > is a document uniqueness requirement (a unique id field).
> 
> This is easy to do with a single index.  Here's the loop:
> 
>   1. Poll DB for updated and new documents.
>   2. Delete all updated docs from an IndexReader & close it.
>   3. Add all new & updated to an IndexWriter & close it.
>   4. Tell DB that documents are updated.
>   5. Checkpoint index.
>   6. Repeat.
> 
> Deleting is much faster than adding.
> 
> Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org