lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Jain <Eric.J...@isb-sib.ch>
Subject Re: Lucene Performance Issues
Date Tue, 28 Mar 2006 11:09:31 GMT
thomasg wrote:
> 1) By default, Lucene only indexes the first 10,000 words from each
> document. When increasing this default out-of-memory errors can occur. This
> implies that documents, or large sections thereof, are loaded into memory.
> ISYS has a very small memory footprint which is not affected by document
> size nor number of documents.

As far as I know, documents do indeed have to be built in memory prior to 
indexing. But this shouldn't be a problem unless you have only a few 
megabytes of memory, or you have documents that are hundreds of megabytes 
large -- and such large documents should probably be split, anyway.


> 2) Lucene appears to be slow at indexing, at least by ISYS' standards.
> Published performance benchmarks seem to vary between almost acceptable,
> down to very poor. ISYS' file readers are already optimized for the fastest
> text extraction possible.

Indexing performance is my main concern with Lucene, though there are 
several parameters that can be tuned and I haven't exhausted all of them yet...

Currently I am using:

   writer.setMergeFactor(100);
   writer.setMaxBufferedDocs(100);
   writer.setUseCompoundFile(false);

This allows me to build a 3GB index with about 3M documents in 6h on a 
2x2GHz Intel Xeon machine with 1GB of memory and a reasonably fast hard 
disk. There is some other stuff going on besides the indexing, but the 
indexing does seem to take up the greatest amount of time.

Note that Lucene also supports incremental updates.


> 3) The Lucene documentation suggests it can be slow at searching and can get
> slower and slower the larger your indexes get. The tipping point is where
> the index size exceeds the amount of free memory in your machine. This also
> implies that whole indexes, or large portions of them, are loaded into
> memory. The bigger the index, the more powerful the machine required. ISYS'
> search speed is always proportional to the size of the result set. Index
> size does not materially affect search speed and the index is never loaded
> into memory. It also appears that Lucene requires hands-on tuning to keep
> its search speed acceptable. ISYS' indexes are self-managing and do not
> require any maintenance to keep them searchable at full speed.

Queries on the index mentioned above return results within a few 
milliseconds, with less than 256MB used by the VM, though some complex 
queries that contain a lot of frequent terms may take up to several 
seconds. I'm not sure how Lucene's searching performance can be tuned, but 
haven't bother to do so as it hasn't been a bottleneck, so far...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message