Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <9ac0c6aa1003290417h540294a0y6127255af07e18bf@mail.gmail.com>
References: <19db4fae1003290357s1d45fbc3w36c1906d0fa64482@mail.gmail.com>
	 <9ac0c6aa1003290417h540294a0y6127255af07e18bf@mail.gmail.com>
Date: Mon, 29 Mar 2010 07:21:41 -0400
Message-ID: <9ac0c6aa1003290421o17778255n9d31e9227c0858d0@mail.gmail.com>
Subject: Re: Lucene scalability observations with a large volatile Index
From: Michael McCandless <lucene@mikemccandless.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

OK I opened https://issues.apache.org/jira/browse/LUCENE-2357 for #3.

Mike

On Mon, Mar 29, 2010 at 7:17 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> On #1: Unfortunately, you cannot control the terms index divisor that
> IW uses when opening its internal readers.
>
> Long term we need to factor out the reader pool that IW uses... so
> that an app can provide its own impl that could control this (and
> other) settings. =A0There's already work being done to do some of this
> refactoring... but I'll open an issue specifically to make sure we can
> control the terms index divisor in particular, in case the refactoring
> doesn't resolve this by 3.1. =A0OK I opened
> https://issues.apache.org/jira/browse/LUCENE-2356.
>
> But there is a possible workaround, in 2.9.x, which may or may not
> work for you: call IndexWriter.getReader(int termInfosIndexDivisor).
> This returns an NRT reader which you can immediately close if you
> don't need to use it, but, it causes IW to pool the readers, and those
> readers first opened via getReader will have the right terms index
> divisor set. =A0You could call this immediately on opening a new writer.
> =A0This isn't a perfect workaround, though, since newly merged segments
> may still first be loaded when applying deletes...
>
> Hmm on #2, LUCENE-1717 was supposed to address properly accounting for
> RAM usage of buffered deletions. =A0Are you sure the OOME was due purely
> to IW using too much RAM? =A0How many terms had you added since the last
> flush? =A0(You can turn on infoStream in IW to see flushes). =A0It could
> be we are undercounting bytes used per deleted term... =A0One possible
> workaround is to use IW.setMaxBufferedDeleteTerms? =A0Ie, flush by count
> instead of RAM usage.
>
> On #3, Lucene needs this int[] to remap docIDs when compacting
> deletions. =A0Maybe set the maxMergeMB so that big segments are not
> merged? =A0This'd mean you'd never have a fully optimized index...
>
> We could consider using packed ints here... and perhaps instead of
> storing docID, store the cumulative delete count, which typically
> would be a smaller number... I'll open an issue for this.
>
> Probably, also, you should switch to a 64 bit JRE :)
>
> Mike
>
> On Mon, Mar 29, 2010 at 6:57 AM, ajjb 936 <ajjb963@googlemail.com> wrote:
>> Hi,
>>
>> I have some observations when using Lucene with my particular use case, =
I
>> thought it may be useful to capture some of these observations.
>>
>> I need to create and continuously update a Lucene Index where each docum=
ent
>> adds (2 to 3) unique terms. The number of documents in the index is betw=
een
>> 150 - 200 million and the number of unique terms in the index is around =
300
>> - 600 million. I am running on 32bit Windows. Lucene versions 2.4 and 2.=
9.2.
>>
>> 1) =A0To reduce memory usage when performing a TermEnum walk of the enti=
re
>> Index I use an appropriate value in the method setTermInfosIndexDivisor(=
 int
>> indexDivisor) on the IndexReader. (I have chosen not to use the
>> setTermIndexInterval(int interval) on the IndexWriter to allow fast rand=
om
>> access). A problem occurs when I try to delete a number of documents fro=
m
>> the Index. The IndexWriter internally creates an IndexReader on which I =
am
>> unable to control the indexDivisor value, this results in an
>> OutOfMemoryError in low memory situations.
>>
>> java.lang.OutOfMemoryError: Java heap space at
>> org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:17=
8)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.TermInfosReader.ensureIndexIsRead(TermInfosReade=
r.java:179)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:225)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:55)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:780)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.jav=
a:952)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.jav=
a:918)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4336)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3572)
>> =A0 =A0 =A0 =A0at org.apache.lucene.index.IndexWriter.flush(IndexWriter.=
java:3442)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1623)
>> =A0 =A0 =A0 =A0at org.apache.lucene.index.IndexWriter.close(IndexWriter.=
java:1588)
>> =A0 =A0 =A0 =A0at org.apache.lucene.index.IndexWriter.close(IndexWriter.=
java:1562)
>>
>> A solution is to set an appropriate value on the IndexWriter
>> setTermIndexInterval(int interval), at the cost of search speed.
>>
>> Is there a way to control the IndexDivisor value on any readers created =
by
>> an IndexWriter? If not, It may be useful to have this ability.
>>
>>
>> 2) When trying to delete large numbers of documents from the index, usin=
g an
>> IndexWriter, it appears that using the method setRAMBufferSizeMB() has n=
o
>> effect. I consistently run out of memory when trying to delete a third o=
f
>> all documents in my index (stack trace below). I realised that even if t=
he
>> RAMBufferSize was used , the IndexWriter would have to perform a full
>> TermEnum walk of the Index every time the RAM Buffer was full, which wou=
ld
>> really slow the deletion process down, (In addition I would face the pro=
blem
>> mentioned above).
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> =A0 =A0 =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.DocumentsWriter.addDeleteTerm(DocumentsWriter.ja=
va:1008)
>> =A0 =A0 =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.DocumentsWriter.bufferDeleteTerm(DocumentsWriter=
.java:861)
>> =A0 =A0 =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.IndexWriter.deleteDocuments(IndexWriter.java:193=
8)
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
>> =A0at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:=
167)
>> at org.apache.lucene.index.SegmentMergeInfo.next(SegmentMergeInfo.java:6=
6)
>> =A0at
>> org.apache.lucene.index.MultiSegmentReader$MultiTermEnum.next(MultiSegme=
ntReader.java:495)
>>
>> As a work around, I am using an IndexReader to perform the deletes as it=
 is
>> far more memory efficient.
>>
>> Another solution may be to call commit on the IndexWriter more often ( i=
.e.
>> perform the deletes as smaller transactions)
>>
>> 3) In some scenarios, we have chosen to postpone an optimize, and to use=
 the
>> method expungeDeletes() on IndexWriter. We face another memory issue her=
e in
>> that Lucene creates an int[] with the size of indexReader.maxDoc(). With
>> 200million docs the initialisation of this array causes an OutOfMemoryEr=
ror
>> in low memory situations, just the initialisation of this array uses up
>> about 800MB of memory.
>>
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.SegmentMergeInfo.getDocMap(SegmentMergeInfo.java=
:44)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:=
517)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:500)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:140)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4226)
>> =A0 =A0 =A0 =A0at org.apache.lucene.index.IndexWriter.merge(IndexWriter.=
java:3877)
>> =A0 =A0 =A0 =A0at
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMerge=
Scheduler.java:205)
>>
>> I do not have a work around for this issue, and it is preventing us from
>> running on a 32bit OS. Any advice on this issue would be appreciated.
>>
>> Cheers,
>>
>> Alistair
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org