lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Preventing merging by IndexWriter
Date Tue, 17 Oct 2006 15:15:18 GMT
Ignore the bit about keeping the mappings, it's too tricky unless really
really necessary, since by virtue of updating the meta-data document, you'll
delete a document, thus perhaps changing the Lucene IDs.

I should proofread before hitting the "send" button <G>...


On 10/17/06, Erick Erickson <> wrote:
> Why go through all this effort when it's easy to make your own unique ID?
> Add a new field to each document "myuniqueid" and fill it in yourself. It'll
> never change then.
> The complex coordination way.
> To coordinate things, you could keep the last ID used (and maybe other
> information) in a unique doc. Remember that Lucene doesn't require that
> every doc has the same fields. So it's possible to keep meta-data in a
> document without ever getting that document confused with your "real"
> documents. You could even keep the lucene docID <-> your new ID mapping in
> this document if it would help. Imagine a doc like this...
> metadata (contains some dummy string to make finding this doc easy).
> lastuniqueid (contains the last document ID you assigned).
> mapping(contains, say, a bunch of strings of the form ####:@@@@ where the
> #### is the Lucene ID and @@@@ is your id) I suspect you don't really need
> this.
> Whenever you add documents to your index, read the meta-data document, use
> lastunieuqid + 1 for your new document, and write the meta-data document
> back.
> The easy coordination:
> Alternatively, at the beginning of an indexing run, just use a TermEnum to
> find the maximum already-used uniqueid and assign incremental IDs to new
> documents. I was surprised at how very quickly you can enumerate a field in
> a Lucene index. Look around at the various TermEnum classes to see if you
> can easily find the largest one rather than iterating all of the terms. This
> may not be a good idea if you're indexing small chunks often....
> Of course, I may misunderstand your problem space enough that this is
> useless. If so, please tell us the problem you're trying to solve and maybe
> wiser heads than mine will have better suggestions....
> Erick
> On 10/17/06, Johan Stuyts <> wrote:
> >
> > Hi,
> >
> > (I am using Lucene 2.0.0)
> >
> > I have been looking at a way to use stable IDs with Lucene. The reason I
> > want this is so I can efficiently store and retrieve information outside
> > of Lucene for filtering search results. It looks like this is going to
> > require most of Lucene to be rewritten, so I gave up on that approach.
> >
> > I have a new idea where I want the documents IDs to only change at a
> > specific moment instead of whenever Lucene choses to do so. This way the
> >
> > document IDs remain stable and I can use these IDs in the external data.
> > I want to merge the segments of the index at a specific moment because
> > updating the external data to match the new document IDs is too
> > expensive to do continuously. At the moment that I want to merge the
> > segments of the index causing the document IDs to change, I can also
> > update my external data so the correct data is attached to the correct
> > Lucene document ID. If I understand correctly, merging only shifts
> > document IDs to remove deleted document IDs, so I can do the same
> > shifting with the external data by getting the set of deleted documents
> > before the merge.
> >
> > I already set 'mergeFactor' and 'maxBufferedDocs' to very high values so
> >
> > all documents of a batch will be stored in RAM. The problem I am facing
> > is that the IndexWriter merges the segments in RAM with the segments on
> > disk when I close the IndexWriter. What I need instead is that the
> > IndexWriter will create a new segment on disk containing the data from
> > the segment(s) in RAM. This way the document IDs of the exising disk
> > segments are not affected.
> >
> > Creating a new segment instead of merging with the existing ones will
> > also cause lots of segments with a variable number of documents to be
> > created on disk, but I believe the IndexReader/IndexSearcher is able to
> > handle this. I only have to make sure that the number of segments does
> > not become to high (i.e. merge regularly) because this might cause 'too
> > many open files' errors.
> >
> > So my questions are: is there a way to prevent the IndexWriter from
> > merging, forcing it to create a new segment for each indexing batch? And
> >
> > if so, will it still be possible to merge the disk segments when I want
> > to?
> >
> > Kind regards,
> >
> > Johan Stuyts
> > Hippo
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message