From Michael McCandless <>
Subject Re: Lucene scalability observations with a large volatile Index
Date Mon, 29 Mar 2010 11:17:55 GMT
On #1: Unfortunately, you cannot control the terms index divisor that
IW uses when opening its internal readers.

Long term we need to factor out the reader pool that IW uses... so
that an app can provide its own impl that could control this (and
other) settings.  There's already work being done to do some of this
refactoring... but I'll open an issue specifically to make sure we can
control the terms index divisor in particular, in case the refactoring
doesn't resolve this by 3.1.  OK I opened

But there is a possible workaround, in 2.9.x, which may or may not
work for you: call IndexWriter.getReader(int termInfosIndexDivisor).
This returns an NRT reader which you can immediately close if you
don't need to use it, but, it causes IW to pool the readers, and those
readers first opened via getReader will have the right terms index
divisor set.  You could call this immediately on opening a new writer.
 This isn't a perfect workaround, though, since newly merged segments
may still first be loaded when applying deletes...

Hmm on #2, LUCENE-1717 was supposed to address properly accounting for
RAM usage of buffered deletions.  Are you sure the OOME was due purely
to IW using too much RAM?  How many terms had you added since the last
flush?  (You can turn on infoStream in IW to see flushes).  It could
be we are undercounting bytes used per deleted term...  One possible
workaround is to use IW.setMaxBufferedDeleteTerms?  Ie, flush by count
instead of RAM usage.

On #3, Lucene needs this int[] to remap docIDs when compacting
deletions.  Maybe set the maxMergeMB so that big segments are not
merged?  This'd mean you'd never have a fully optimized index...

We could consider using packed ints here... and perhaps instead of
storing docID, store the cumulative delete count, which typically
would be a smaller number... I'll open an issue for this.

Probably, also, you should switch to a 64 bit JRE :)


On Mon, Mar 29, 2010 at 6:57 AM, ajjb 936 <> wrote:
> Hi,
> I have some observations when using Lucene with my particular use case, I
> thought it may be useful to capture some of these observations.
> I need to create and continuously update a Lucene Index where each document
> adds (2 to 3) unique terms. The number of documents in the index is between
> 150 - 200 million and the number of unique terms in the index is around 300
> - 600 million. I am running on 32bit Windows. Lucene versions 2.4 and 2.9.2.
> 1)  To reduce memory usage when performing a TermEnum walk of the entire
> Index I use an appropriate value in the method setTermInfosIndexDivisor( int
> indexDivisor) on the IndexReader. (I have chosen not to use the
> setTermIndexInterval(int interval) on the IndexWriter to allow fast random
> access). A problem occurs when I try to delete a number of documents from
> the Index. The IndexWriter internally creates an IndexReader on which I am
> unable to control the indexDivisor value, this results in an
> OutOfMemoryError in low memory situations.
> java.lang.OutOfMemoryError: Java heap space at
> org.apache.lucene.index.SegmentTermEnum.termInfo(
>        at
> org.apache.lucene.index.TermInfosReader.ensureIndexIsRead(
>        at
> org.apache.lucene.index.TermInfosReader.get(
>        at
> org.apache.lucene.index.TermInfosReader.get(
>        at
>        at
> org.apache.lucene.index.IndexReader.termDocs(
>        at
> org.apache.lucene.index.DocumentsWriter.applyDeletes(
>        at
> org.apache.lucene.index.DocumentsWriter.applyDeletes(
>        at
> org.apache.lucene.index.IndexWriter.applyDeletes(
>        at
> org.apache.lucene.index.IndexWriter.doFlush(
>        at org.apache.lucene.index.IndexWriter.flush(
>        at
> org.apache.lucene.index.IndexWriter.closeInternal(
>        at org.apache.lucene.index.IndexWriter.close(
>        at org.apache.lucene.index.IndexWriter.close(
> A solution is to set an appropriate value on the IndexWriter
> setTermIndexInterval(int interval), at the cost of search speed.
> Is there a way to control the IndexDivisor value on any readers created by
> an IndexWriter? If not, It may be useful to have this ability.
> 2) When trying to delete large numbers of documents from the index, using an
> IndexWriter, it appears that using the method setRAMBufferSizeMB() has no
> effect. I consistently run out of memory when trying to delete a third of
> all documents in my index (stack trace below). I realised that even if the
> RAMBufferSize was used , the IndexWriter would have to perform a full
> TermEnum walk of the Index every time the RAM Buffer was full, which would
> really slow the deletion process down, (In addition I would face the problem
> mentioned above).
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>            at
> org.apache.lucene.index.DocumentsWriter.addDeleteTerm(
>            at
> org.apache.lucene.index.DocumentsWriter.bufferDeleteTerm(
>            at
> org.apache.lucene.index.IndexWriter.deleteDocuments(
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at org.apache.lucene.index.TermBuffer.toTerm(
>  at org.apache.lucene.index.SegmentTermEnum.term(
> at
>  at
> org.apache.lucene.index.MultiSegmentReader$
> As a work around, I am using an IndexReader to perform the deletes as it is
> far more memory efficient.
> Another solution may be to call commit on the IndexWriter more often ( i.e.
> perform the deletes as smaller transactions)
> 3) In some scenarios, we have chosen to postpone an optimize, and to use the
> method expungeDeletes() on IndexWriter. We face another memory issue here in
> that Lucene creates an int[] with the size of indexReader.maxDoc(). With
> 200million docs the initialisation of this array causes an OutOfMemoryError
> in low memory situations, just the initialisation of this array uses up
> about 800MB of memory.
> Caused by: java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.lucene.index.SegmentMergeInfo.getDocMap(
>        at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(
>        at
> org.apache.lucene.index.SegmentMerger.mergeTerms(
>        at
> org.apache.lucene.index.SegmentMerger.merge(
>        at
> org.apache.lucene.index.IndexWriter.mergeMiddle(
>        at org.apache.lucene.index.IndexWriter.merge(
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(
> I do not have a work around for this issue, and it is preventing us from
> running on a 32bit OS. Any advice on this issue would be appreciated.
> Cheers,
> Alistair

