Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 6084 invoked from network); 9 Apr 2010 10:32:49 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Apr 2010 10:32:49 -0000 Received: (qmail 80055 invoked by uid 500); 9 Apr 2010 10:32:48 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 80021 invoked by uid 500); 9 Apr 2010 10:32:48 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 80014 invoked by uid 99); 9 Apr 2010 10:32:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Apr 2010 10:32:47 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=AWL,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [209.85.217.225] (HELO mail-gx0-f225.google.com) (209.85.217.225) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Apr 2010 10:32:43 +0000 Received: by gxk25 with SMTP id 25so588766gxk.11 for ; Fri, 09 Apr 2010 03:32:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.150.137.5 with HTTP; Fri, 9 Apr 2010 03:32:21 -0700 (PDT) In-Reply-To: References: <00c301cad702$d10a4ca0$731ee5e0$@de> <00fe01cad715$54c5ba90$fe512fb0$@de> <013601cad765$5fa5ca90$1ef15fb0$@de> Date: Fri, 9 Apr 2010 06:32:21 -0400 Received: by 10.151.87.7 with SMTP id p7mr1214ybl.340.1270809141247; Fri, 09 Apr 2010 03:32:21 -0700 (PDT) Message-ID: Subject: Re: IndexWriter memory leak? From: Michael McCandless To: java-dev@lucene.apache.org, Ruben Laguna Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I agree IW should not hold refs to the Field instances from the last doc indexed... I put a patch on LUCENE-2387 to null the reference as we go. Can you confirm this lets GC reclaim? Mike On Fri, Apr 9, 2010 at 12:54 AM, Ruben Laguna wrot= e: > But the Readers I'm talking about are not held by the Tokenizer (at least > not *only* by it), these are held by the DocFieldProccessorPerThread.... > > IndexWriter -> DocumentsWriter -> DocumentsWriterThreadState -> > DocFieldProcessorPerThread -> DocFieldProcessorPerField -> Fieldable -> > Field (fieldsData) > > and it's not only one Reader there are several one (one per thread I > suppose, in my heapdump there is 25 Reader that should have been GCed > otherwise). > > Best regards/Ruben > On Thu, Apr 8, 2010 at 11:49 PM, Uwe Schindler wrote: > >> There is one possibility, that could be fixed: >> >> As Tokenizers are reused, the analyzer holds a reference to the last use= d >> Reader. The easy fix would be to unset the Reader in Tokenizer.close(). = If >> this is the case for you, that may be easy to do. So Tokenizer.close() l= ooks >> like this: >> >> =A0/** By default, closes the input Reader. */ >> =A0@Override >> =A0public void close() throws IOException { >> =A0 =A0input.close(); >> =A0 =A0input =3D null; // <-- new! >> =A0} >> >> Any comments from other committers? >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: uwe@thetaphi.de >> >> >> > -----Original Message----- >> > From: Ruben Laguna [mailto:ruben.laguna@gmail.com] >> > Sent: Thursday, April 08, 2010 2:50 PM >> > To: java-user@lucene.apache.org >> > Subject: Re: IndexWriter memory leak? >> > >> > I will double check in the afternoon the heapdump.hprof. But I think >> > that >> > *some* readers are indeed held by >> > docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx], >> > as shown in [1] (this heapdump contains only live objects). =A0The >> > heapdump >> > was taken after IndexWriter.commit() /IndexWriter.optimize() and all >> > the >> > Documents were already indexed and GCed (I will double check). >> > >> > So that would mean that the Reader is retained in memory by the >> > following >> > chaing of references, >> > >> > DocumentsWriter -> DocumentsWriterThreadState -> >> > DocFieldProcessorPerThread >> > -> DocFieldProcessorPerField -> Fieldable -> Field (fieldsData) >> > >> > I'll double check with Eclipse MAT as I said that this chain is >> > actually >> > made of hard references only (no SoftReferences,WeakReferences, etc). = I >> > will >> > also double check also that there is no "live" Document that is >> > referencing >> > the Reader via the Field. >> > >> > >> > [1] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg >> > >> > On Thu, Apr 8, 2010 at 2:16 PM, Uwe Schindler wrote: >> > >> > > Readers are not held. If you indexed the document and gced the >> > document >> > > instance they readers are gone. >> > > >> > > ----- >> > > Uwe Schindler >> > > H.-H.-Meier-Allee 63, D-28213 Bremen >> > > http://www.thetaphi.de >> > > eMail: uwe@thetaphi.de >> > > >> > > >> > > > -----Original Message----- >> > > > From: Ruben Laguna [mailto:ruben.laguna@gmail.com] >> > > > Sent: Thursday, April 08, 2010 1:28 PM >> > > > To: java-user@lucene.apache.org >> > > > Subject: Re: IndexWriter memory leak? >> > > > >> > > > Now that the zzBuffer issue is solved... >> > > > >> > > > what about the references to the Readers held by docWriter. Tika= =B4s >> > > > ParsingReaders are quite heavyweight so retaining those in memory >> > > > unnecesarily is also a "hidden" memory leak. Should I open a bug >> > report >> > > > on >> > > > that one? >> > > > >> > > > /Rub=E9n >> > > > >> > > > On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera >> > wrote: >> > > > >> > > > > Guess we were replying at the same time :). >> > > > > >> > > > > On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler >> > > > wrote: >> > > > > >> > > > > > I already answered, that I will take care of this! >> > > > > > >> > > > > > Uwe >> > > > > > >> > > > > > ----- >> > > > > > Uwe Schindler >> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen >> > > > > > http://www.thetaphi.de >> > > > > > eMail: uwe@thetaphi.de >> > > > > > >> > > > > > >> > > > > > > -----Original Message----- >> > > > > > > From: Shai Erera [mailto:serera@gmail.com] >> > > > > > > Sent: Thursday, April 08, 2010 12:00 PM >> > > > > > > To: java-user@lucene.apache.org >> > > > > > > Subject: Re: IndexWriter memory leak? >> > > > > > > >> > > > > > > Yes, that's the trimBuffer version I was thinking about, onl= y >> > > > this guy >> > > > > > > created a reset(Reader, int) and does both ops (resetting + >> > trim) >> > > > in >> > > > > > > one >> > > > > > > method call. More convenient. Can you please open an issue t= o >> > > > track >> > > > > > > that? >> > > > > > > People will have a chance to comment on whether we (Lucene) >> > > > should >> > > > > > > handle >> > > > > > > that, or it should be a JFlex fix. Based on the number of >> > replies >> > > > this >> > > > > > > guy >> > > > > > > received (0 !), I doubt JFlex would consider it a problem. >> > But we >> > > > can >> > > > > > > do >> > > > > > > some small service to our users base by protecting against >> > such >> > > > > > > problems. >> > > > > > > >> > > > > > > And while you're opening the issue, if you want to take a >> > stab at >> > > > > > > fixing it >> > > > > > > and post a patch, it'd be great :). >> > > > > > > >> > > > > > > Shai >> > > > > > > >> > > > > > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna >> > > > > > > wrote: >> > > > > > > >> > > > > > > > I was investigating this a little further and in the JFlex >> > > > mailing >> > > > > > > list I >> > > > > > > > found [1] >> > > > > > > > >> > > > > > > > I don't know much about flex / JFlex but it seems that thi= s >> > guy >> > > > > > > resets the >> > > > > > > > zzBuffer to 16384 or less when setting the input for the >> > lexer >> > > > > > > > >> > > > > > > > >> > > > > > > > Quoted from =A0shef >> > > > > > > > >> > > > > > > > >> > > > > > > > I set >> > > > > > > > >> > > > > > > > %buffer 0 >> > > > > > > > >> > > > > > > > in the options section, and then added this method to the >> > > > lexer: >> > > > > > > > >> > > > > > > > =A0 =A0/** >> > > > > > > > =A0 =A0 * Set the input for the lexer. The size parameter >> > really >> > > > speeds >> > > > > > > things >> > > > > > > > up, >> > > > > > > > =A0 =A0 * because by default, the lexer allocates an inter= nal >> > > > buffer of >> > > > > > > 16k. >> > > > > > > > For >> > > > > > > > =A0 =A0 * most strings, this is unnecessarily large. If th= e >> > size >> > > > param is >> > > > > > > > 0 or greater >> > > > > > > > =A0 =A0 * than 16k, then the buffer is set to 16k. If the = size >> > > > param is >> > > > > > > > smaller, then >> > > > > > > > =A0 =A0 * the buf will be set to the exact size. >> > > > > > > > =A0 =A0 * @param r the reader that provides the data >> > > > > > > > =A0 =A0 * @param the size of the data in the reader. >> > > > > > > > =A0 =A0 */ >> > > > > > > > =A0 =A0public void reset(Reader r, int size) { >> > > > > > > > =A0 =A0 =A0 =A0if (size =3D=3D 0 || size > 16384) >> > > > > > > > =A0 =A0 =A0 =A0 =A0 =A0size =3D 16384; >> > > > > > > > =A0 =A0 =A0 =A0zzBuffer =3D new char[size]; >> > > > > > > > =A0 =A0 =A0 =A0yyreset(r); >> > > > > > > > =A0 =A0} >> > > > > > > > >> > > > > > > > >> > > > > > > > So maybe there is a way to trim the zzBuffer this way (?). >> > > > > > > > >> > > > > > > > BTW, I will try to find out which is the "big token" in my >> > > > dataset >> > > > > > > this >> > > > > > > > afternoon. Thanks for the help. >> > > > > > > > >> > > > > > > > I actually workaround this memory problem for the time >> > being by >> > > > > > > wrapping >> > > > > > > > the >> > > > > > > > IndexWriter in a class that periodically closes the >> > IndexWriter >> > > > and >> > > > > > > creates >> > > > > > > > a new one, allowing the old to be GCed, but I would be >> > really >> > > > good if >> > > > > > > > either >> > > > > > > > JFlex or Lucene can take care of this zzBuffer going >> > berserk. >> > > > > > > > >> > > > > > > > >> > > > > > > > Again thanks for the quick response. /Rub=E9n >> > > > > > > > >> > > > > > > > >> > > > > > > > [1] >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > >> > > > >> > https://sourceforge.net/mailarchive/message.php?msg_id=3D444070.38422.= qm@ >> > > > > > > web38901.mail.mud.yahoo.com >> > > > > > > > >> > > > > > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera >> > >> > > > > wrote: >> > > > > > > > >> > > > > > > > > If we could change the Flex file so that yyreset(Reader) >> > > > would >> > > > > > > check the >> > > > > > > > > size of zzBuffer, we could trim it when it gets too big. >> > But >> > > > I >> > > > > > > don't >> > > > > > > > think >> > > > > > > > > we have such control when writing the flex syntax ... >> > yyreset >> > > > is >> > > > > > > > generated >> > > > > > > > > by JFlex and that's the only place I can think of to tri= m >> > the >> > > > > > > buffer down >> > > > > > > > > when it exceeds a predefined threshold .... >> > > > > > > > > >> > > > > > > > > Maybe what we can do is create our own method which will >> > be >> > > > called >> > > > > > > by >> > > > > > > > > StandardTokenizer after yyreset is called, something lik= e >> > > > > > > > > trimBufferIfTooBig(int threshold) which will reallocate >> > > > zzBuffer if >> > > > > > > it >> > > > > > > > > exceeded the threshold. We can decide on a reasonable 64= K >> > > > threshold >> > > > > > > or >> > > > > > > > > something, or simply always cut back to 16 KB. As far as >> > I >> > > > > > > understand, >> > > > > > > > that >> > > > > > > > > buffer should never grow that much. I.e. in zzRefill, >> > which >> > > > is the >> > > > > > > only >> > > > > > > > > place where the buffer gets resized, there is an attempt >> > to >> > > > first >> > > > > > > move >> > > > > > > > back >> > > > > > > > > characters that were already consumed and only then >> > allocate >> > > > a >> > > > > > > bigger >> > > > > > > > > buffer. Which means only if there is a token whose size >> > is >> > > > larger >> > > > > > > than >> > > > > > > > 16KB >> > > > > > > > > (!?), will this buffer get expanded. >> > > > > > > > > >> > > > > > > > > A trimBuffer method might not be that bad .. as a >> > protective >> > > > > > > measure. >> > > > > > > > What >> > > > > > > > > do you think? Of course, JFlex can fix it on their own >> > ... >> > > > but >> > > > > > > until that >> > > > > > > > > happens ... >> > > > > > > > > >> > > > > > > > > Shai >> > > > > > > > > >> > > > > > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler >> > > > >> > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > > I would like to identify also the problematic >> > document I >> > > > have >> > > > > > > 10000 >> > > > > > > > so, >> > > > > > > > > > > what >> > > > > > > > > > > would be the best way of identifying the one that it >> > > > making >> > > > > > > zzBuffer >> > > > > > > > to >> > > > > > > > > > > grow >> > > > > > > > > > > without control? >> > > > > > > > > > >> > > > > > > > > > Dont index your documents, but instead pass them >> > directly >> > > > to the >> > > > > > > > analyzer >> > > > > > > > > > and consume the tokenstream manually. Then visit >> > > > > > > > > TermAttribute.termLength() >> > > > > > > > > > for each Token. >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > ------------------------------------------------------= - >> > ---- >> > > > ------ >> > > > > > > ---- >> > > > > > > > > > To unsubscribe, e-mail: java-user- >> > > > unsubscribe@lucene.apache.org >> > > > > > > > > > For additional commands, e-mail: >> > > > > java-user-help@lucene.apache.org >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > -- >> > > > > > > > /Rub=E9n >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > --------------------------------------------------------------= - >> > ---- >> > > > -- >> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.or= g >> > > > > > For additional commands, e-mail: java-user- >> > help@lucene.apache.org >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > /Rub=E9n >> > > >> > > >> > > --------------------------------------------------------------------= - >> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> > > For additional commands, e-mail: java-user-help@lucene.apache.org >> > > >> > > >> > >> > >> > -- >> > /Rub=E9n >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > > > -- > /Rub=E9n > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org