Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 35877 invoked from network); 1 Apr 2010 23:21:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Apr 2010 23:21:45 -0000 Received: (qmail 75952 invoked by uid 500); 1 Apr 2010 23:21:44 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 75800 invoked by uid 500); 1 Apr 2010 23:21:43 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 75792 invoked by uid 99); 1 Apr 2010 23:21:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Apr 2010 23:21:43 +0000 X-ASF-Spam-Status: No, hits=-1.1 required=10.0 tests=AWL,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [209.85.210.200] (HELO mail-yx0-f200.google.com) (209.85.210.200) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Apr 2010 23:21:38 +0000 Received: by yxe38 with SMTP id 38so745849yxe.22 for ; Thu, 01 Apr 2010 16:21:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.151.114.2 with HTTP; Thu, 1 Apr 2010 16:21:17 -0700 (PDT) In-Reply-To: <492379B6AA909E4FAF672AACCA0EADC24A40EE9B10@PHXCCRPRD02.adprod.bmc.com> References: <492379B6AA909E4FAF672AACCA0EADC24A40EE9B10@PHXCCRPRD02.adprod.bmc.com> Date: Thu, 1 Apr 2010 19:21:17 -0400 Received: by 10.150.66.18 with SMTP id o18mr2111505yba.96.1270164077638; Thu, 01 Apr 2010 16:21:17 -0700 (PDT) Message-ID: Subject: Re: IndexWriter and memory usage From: Michael McCandless To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hmm, not good. Can you post a heap dump? Also, can you turn on infoStream, index up to the OOM @ 512 MB, and post the output? IndexWriter should not hang onto much beyond the RAM buffer. But, it does allocate and then recycle this RAM buffer, so even in an idle state (having indexed enough docs to fill up the RAM buffer at least once) it'll hold onto those 16 MB. Are you using getReader (to get your NRT readers)? If so, are you really sure you're eventually closing the previous reader after opening a new one? Mike On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross wrote: > We are seeing a situation where the IndexWriter is using up the Java Heap= space and only releases memory for garbage collection upon a commit. =A0 W= e are using the default RAMBufferSize of 16 mb. =A0We are using Lucene 2.9.= 1. We are set at heap size of 512 mb. > > We have a large number of documents that are run through Tika and then ad= ded to the index. =A0The data from Tika is changed to a string, and then se= nt to Lucene. =A0Heap dumps clearly show the data in the Lucene classes and= not in Tika. =A0Our intent is to only perform a commit once the entire ind= exing run is complete, but several hours into the process everything comes = to a crawl. =A0In using both JConsole and VisualVM =A0we can see that the h= eap space is maxed out and garbage collection is not able to clean up any m= emory once we get into this state. =A0It is our understanding that the Inde= xWriter should be only holding onto 16 mb of data before it flushes it, but= what we are seeing is that while it is in fact writing data to disk when i= t hits the 16 mb limit, it is also holding onto some data in memory and not= allowing garbage collection to take place, and this continues until garbag= e collection is unable to free up enough space to all things to move faster= than a crawl. > > As a test we caused a commit to occur after each document is indexed and = we see the total amount of memory reduced from nearly 100% of the Java Heap= to around 70-75%. =A0The profiling tools now show that the memory is clean= ed up to some extent after each document. =A0But of course this completely = defeats the whole reason why we want to only commit at the end of the run f= or performance sake. =A0Most of the data, as seen using Heap analasis, is h= eld in Byte, Character, and Integer classes whos GC roots are tied back to = the Writer Objects and threads. =A0The instance counts, after running just = 1,100 documents seems staggering > > Is there additional data that the IndexWriter hangs onto regardless of wh= en it hits the RAMBufferSize limit? =A0Why are we seeing the heap space all= being used up? > > A side question to this is the fact that we always see a large amount of = memory used by the IndexWriter even after our indexing has been completed a= nd all commits have taken place (basically in an idle state). =A0Why would = this be? =A0Is the only way to totally clean up the memory is to close the = writer? =A0Our index is also used for real time indexing so the IndexWriter= is intended to remain open for the lifetime of the app. > > Any help in understanding why the IndexWriter is maxing out our heap spac= e or what is expected from memory usage of the IndexWriter would be appreci= ated. > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org