lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Periodic Indexing DESIGN QUESTION
Date Tue, 08 May 2007 23:29:56 GMT
Don't do it that way <G>? Is this an actual or theoretical
scenario? And do you reasonably expect it to become actual?
Otherwise, why bother?

And you've got other problems here. If you're indexing that
much data, you'll soon outgrow your disk. Unless you're
replacing most of the documents.

But assuming that all this is somehow not a problem, I'd
consider something like indexing by directory. That is, for an
hour, collect all the incoming documents in directory d1. Then
turn an indexer process loose on d1 and start collecting docs in
d2. At the end of the next hour, start indexing d2 and collecting d3.

When each indexing process finishes, you can use
IndexWriter.addIndexes. Or you could batch them up and add
all the indexes that have been created in the last, say, 6 hours
at once. You could even split this across multiple machines if
you get CPU bound.

That said, I can't stress enough that you really need to consider
how long you can keep indexing data at that rate and have any
performance to speak of at search time.

If you're not indexing that much data, *and* you still have speed
problems, I'd look long and hard at my code to see why indexing
is taking so long. Are you closing/reopening the IndexWriter? Are
you optimizing too often? Is the way you access the data (perhaps
querying a database) painful?

Some real numbers would help. Things like:
How many documents are in your index?
How many arrive each hour?
How long does it take to index, say, 100 docs?
How big are the docs upon input? How much bigger do they make the index?

Have you measured *any* of these things? if so, please post
the numbers. Think about doing everything *except* indexing and see
if your bottleneck is somewhere unexpected.

Anyway, hope this helps

On 5/8/07, Ram Peters <> wrote:
> I am indexing documents periodically every hour.  I have a scenario.
> For example, when you are indexing every hour and large document set
> is present, it takes >1 hr to index the documents.  Now you are
> already behind indexing for the next hour.  How do you design
> something that is robust?
> thanks.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message