Armbrust, Daniel C. wrote:
> While I was trying to build this index, the biggest limitation of
> Lucene that I ran into was optimization. Optimization kills the
> indexers performance when you get between 3-5 million documents in an
> index. On my Windows XP box, I had to reoptimize every 100,000
> documents to keep from running out of file handles.
What is the file handle limit on XP?
When batch indexing, optimizing before the end slows things down, and
should not be required.
Are you otherwise opening index readers in the same process? Index
readers use a lot more file handles than the index writer, since they
must keep all files in all segments open. For large indexes it's best
to do your indexing in a separate process which never opens an IndexReader.
The max a reader will keep open is:
mergeFactor * log_base_mergeFactor(N) * files_per_segment
With mergeFactor=10 (the default) and 1 million documents, and 10 files
per segment, a reader on a never-optimized index should at most require
600 open files, and typically half that.
A writer will open:
(1 + mergeFactor) * files_per_segment
With mergeFactor=10 (the default) and 1 million documents, a writer on a
never-optimized index would require 110 open files.
I just built a 3M document index on Linux in five hours, with no
intermediate optimizations. I set the mergeFactor to 50. This required
around 500 file handles, well beneath the 1024 limit.
Doug
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
|