lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Galambos <>
Subject Re: large index scalability and java 1.1 compatibility question
Date Tue, 20 Jan 2004 21:57:38 GMT
I'm sorry if you receive this e-mail twice. My ISP has problems with 
SMTP relay.

Mike Sawka wrote:

>We are currently running some multi-gigabyte indexes with over 10
>million documents, and the "optimize" time is starting to become a
>problem.  For our largest indexes we're already seeing times of 10-20
>minutes, on a fairly decent machine, which is starting to hit the
>threshold of acceptability for us (and will become unbearable as the
>index grows 2-10 times larger).  So I've got two questions:
>   * Are there any tricks that you guys use to run large (incrementally
>updatable) indexes?  I've already setup a mirroring system so I have one
>index that is always searchable while the other one is incrementally
>updating (and they swap periodically).

The optimize() routine is a bottleneck of Lucene. You have two options:
a) not to call optimize(); b) modify your index significantly (>75% of
items), and then call optimize(). Somebody may give you some advice, but
there is theoretical barrier which cannot be undone.

I was interested in this problem last year, and the method which was
developed for another OSS search engine is presented here: The figure shows comparison
between my method and a build-from-scratch approach of merge factor 100.
Lucene (merge factor 100) seems to be slower than my method: about 40%
in case of N=2^16, about 15-20% in case of N=2^46, thus add these values
to the presented numbers, and you would see what Lucene does and when.

Using the figure, you can analyze whether you would rather rebuild your
index from scratch, or repair it using insert/removeDoc()/optimize(). If
both ways failed, you should redesign your application.

Hope this helps.


PS: The figure is based on a simulation of my algorithm. The results for
N<2^26 were already verified in a real system. "number of documents" is
log_2(total number of documents in the index) (2^16...2^46), "operations
needed" summarizes I/O read and write operations and compares them to
I/O during rebuild-from-scratch.

N=total number of docs in the index

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message