Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 10448 invoked from network); 29 Mar 2010 11:22:11 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Mar 2010 11:22:11 -0000 Received: (qmail 65671 invoked by uid 500); 29 Mar 2010 11:22:09 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 65607 invoked by uid 500); 29 Mar 2010 11:22:09 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 65599 invoked by uid 99); 29 Mar 2010 11:22:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Mar 2010 11:22:08 +0000 X-ASF-Spam-Status: No, hits=-0.8 required=10.0 tests=AWL,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.160.176] (HELO mail-gy0-f176.google.com) (209.85.160.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Mar 2010 11:22:03 +0000 Received: by gyd8 with SMTP id 8so5686135gyd.35 for ; Mon, 29 Mar 2010 04:21:41 -0700 (PDT) MIME-Version: 1.0 Received: by 10.150.145.18 with HTTP; Mon, 29 Mar 2010 04:21:41 -0700 (PDT) In-Reply-To: <9ac0c6aa1003290417h540294a0y6127255af07e18bf@mail.gmail.com> References: <19db4fae1003290357s1d45fbc3w36c1906d0fa64482@mail.gmail.com> <9ac0c6aa1003290417h540294a0y6127255af07e18bf@mail.gmail.com> Date: Mon, 29 Mar 2010 07:21:41 -0400 Received: by 10.150.238.4 with SMTP id l4mr68584ybh.128.1269861701828; Mon, 29 Mar 2010 04:21:41 -0700 (PDT) Message-ID: <9ac0c6aa1003290421o17778255n9d31e9227c0858d0@mail.gmail.com> Subject: Re: Lucene scalability observations with a large volatile Index From: Michael McCandless To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable OK I opened https://issues.apache.org/jira/browse/LUCENE-2357 for #3. Mike On Mon, Mar 29, 2010 at 7:17 AM, Michael McCandless wrote: > On #1: Unfortunately, you cannot control the terms index divisor that > IW uses when opening its internal readers. > > Long term we need to factor out the reader pool that IW uses... so > that an app can provide its own impl that could control this (and > other) settings. =A0There's already work being done to do some of this > refactoring... but I'll open an issue specifically to make sure we can > control the terms index divisor in particular, in case the refactoring > doesn't resolve this by 3.1. =A0OK I opened > https://issues.apache.org/jira/browse/LUCENE-2356. > > But there is a possible workaround, in 2.9.x, which may or may not > work for you: call IndexWriter.getReader(int termInfosIndexDivisor). > This returns an NRT reader which you can immediately close if you > don't need to use it, but, it causes IW to pool the readers, and those > readers first opened via getReader will have the right terms index > divisor set. =A0You could call this immediately on opening a new writer. > =A0This isn't a perfect workaround, though, since newly merged segments > may still first be loaded when applying deletes... > > Hmm on #2, LUCENE-1717 was supposed to address properly accounting for > RAM usage of buffered deletions. =A0Are you sure the OOME was due purely > to IW using too much RAM? =A0How many terms had you added since the last > flush? =A0(You can turn on infoStream in IW to see flushes). =A0It could > be we are undercounting bytes used per deleted term... =A0One possible > workaround is to use IW.setMaxBufferedDeleteTerms? =A0Ie, flush by count > instead of RAM usage. > > On #3, Lucene needs this int[] to remap docIDs when compacting > deletions. =A0Maybe set the maxMergeMB so that big segments are not > merged? =A0This'd mean you'd never have a fully optimized index... > > We could consider using packed ints here... and perhaps instead of > storing docID, store the cumulative delete count, which typically > would be a smaller number... I'll open an issue for this. > > Probably, also, you should switch to a 64 bit JRE :) > > Mike > > On Mon, Mar 29, 2010 at 6:57 AM, ajjb 936 wrote: >> Hi, >> >> I have some observations when using Lucene with my particular use case, = I >> thought it may be useful to capture some of these observations. >> >> I need to create and continuously update a Lucene Index where each docum= ent >> adds (2 to 3) unique terms. The number of documents in the index is betw= een >> 150 - 200 million and the number of unique terms in the index is around = 300 >> - 600 million. I am running on 32bit Windows. Lucene versions 2.4 and 2.= 9.2. >> >> 1) =A0To reduce memory usage when performing a TermEnum walk of the enti= re >> Index I use an appropriate value in the method setTermInfosIndexDivisor(= int >> indexDivisor) on the IndexReader. (I have chosen not to use the >> setTermIndexInterval(int interval) on the IndexWriter to allow fast rand= om >> access). A problem occurs when I try to delete a number of documents fro= m >> the Index. The IndexWriter internally creates an IndexReader on which I = am >> unable to control the indexDivisor value, this results in an >> OutOfMemoryError in low memory situations. >> >> java.lang.OutOfMemoryError: Java heap space at >> org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:17= 8) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.TermInfosReader.ensureIndexIsRead(TermInfosReade= r.java:179) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:225) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:55) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:780) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.jav= a:952) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.jav= a:918) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4336) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3572) >> =A0 =A0 =A0 =A0at org.apache.lucene.index.IndexWriter.flush(IndexWriter.= java:3442) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1623) >> =A0 =A0 =A0 =A0at org.apache.lucene.index.IndexWriter.close(IndexWriter.= java:1588) >> =A0 =A0 =A0 =A0at org.apache.lucene.index.IndexWriter.close(IndexWriter.= java:1562) >> >> A solution is to set an appropriate value on the IndexWriter >> setTermIndexInterval(int interval), at the cost of search speed. >> >> Is there a way to control the IndexDivisor value on any readers created = by >> an IndexWriter? If not, It may be useful to have this ability. >> >> >> 2) When trying to delete large numbers of documents from the index, usin= g an >> IndexWriter, it appears that using the method setRAMBufferSizeMB() has n= o >> effect. I consistently run out of memory when trying to delete a third o= f >> all documents in my index (stack trace below). I realised that even if t= he >> RAMBufferSize was used , the IndexWriter would have to perform a full >> TermEnum walk of the Index every time the RAM Buffer was full, which wou= ld >> really slow the deletion process down, (In addition I would face the pro= blem >> mentioned above). >> >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> =A0 =A0 =A0 =A0 =A0 =A0at >> org.apache.lucene.index.DocumentsWriter.addDeleteTerm(DocumentsWriter.ja= va:1008) >> =A0 =A0 =A0 =A0 =A0 =A0at >> org.apache.lucene.index.DocumentsWriter.bufferDeleteTerm(DocumentsWriter= .java:861) >> =A0 =A0 =A0 =A0 =A0 =A0at >> org.apache.lucene.index.IndexWriter.deleteDocuments(IndexWriter.java:193= 8) >> >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122) >> =A0at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:= 167) >> at org.apache.lucene.index.SegmentMergeInfo.next(SegmentMergeInfo.java:6= 6) >> =A0at >> org.apache.lucene.index.MultiSegmentReader$MultiTermEnum.next(MultiSegme= ntReader.java:495) >> >> As a work around, I am using an IndexReader to perform the deletes as it= is >> far more memory efficient. >> >> Another solution may be to call commit on the IndexWriter more often ( i= .e. >> perform the deletes as smaller transactions) >> >> 3) In some scenarios, we have chosen to postpone an optimize, and to use= the >> method expungeDeletes() on IndexWriter. We face another memory issue her= e in >> that Lucene creates an int[] with the size of indexReader.maxDoc(). With >> 200million docs the initialisation of this array causes an OutOfMemoryEr= ror >> in low memory situations, just the initialisation of this array uses up >> about 800MB of memory. >> >> Caused by: java.lang.OutOfMemoryError: Java heap space >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.SegmentMergeInfo.getDocMap(SegmentMergeInfo.java= :44) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:= 517) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:500) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:140) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4226) >> =A0 =A0 =A0 =A0at org.apache.lucene.index.IndexWriter.merge(IndexWriter.= java:3877) >> =A0 =A0 =A0 =A0at >> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMerge= Scheduler.java:205) >> >> I do not have a work around for this issue, and it is preventing us from >> running on a 32bit OS. Any advice on this issue would be appreciated. >> >> Cheers, >> >> Alistair >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org