lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: inserting millions of entries
Date Thu, 28 Jun 2007 18:36:26 GMT
Yes, opening/closing will be very costly. But I *believe*, although I
haven't tried it, that IndexModifier (2.1) will work for you.

But do NOT take my word for it as I haven't tried to do what you're doing.
But it should be easy to write a short test or two to prove that you can
find recently-inserted documents....


On 6/28/07, Jens Grivolla <> wrote:
> Hi,
> I have a Lucene index with a few million entries, and I will need to
> add batches of a few hundred thousand or a few million additional
> entries.  Unfortunately, I absolutely need to have all indexed entries
> available when inserting a new one, even within one batch, in order to
> do some duplicate detection (using Lucene).
> I believe this means having to close the writer and reopen the reader
> to reflect the changes after each add.  I'm thinking of having the
> original index remain static and add the new entries to a separate
> index and merge the two later on.  I can even query them separately
> during the batch insertion if that gives me better performance.
> The questions:
> How costly is merging two big indexes?
> When / how often do I need to call optimize() on the new index?
> Should I just keep MergeFactor at the default value?
> If I have autoCommit=true, I can keep the writer open, but still need
> to flush() and reopen readers to reflect the changes, right?
> Is it better to have a MultiReader on both indexes or query them
> separately so I don't have to reopen the old one every time?
> Additional info:
> Documents are very short, just a few words each.  I have  4 gigs of
> RAM in the machine, of which I could allocate quite a bit as heap for
> the writer if that helps.
> Thanks for any hints on how to best go about this,
> Jens
> P.S.: all my mails to the list get silently dropped when sending through
> GMail, possibly because the sender is not the same as the from: header
> (and only the from: actually contains the subscribed address).  This is
> very annoying and makes it impossible for me to write to the list in
> many situations.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message