Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 91221 invoked from network); 8 Apr 2010 10:04:36 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Apr 2010 10:04:36 -0000 Received: (qmail 46950 invoked by uid 500); 8 Apr 2010 10:04:34 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 46838 invoked by uid 500); 8 Apr 2010 10:04:33 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 46830 invoked by uid 99); 8 Apr 2010 10:04:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Apr 2010 10:04:33 +0000 X-ASF-Spam-Status: No, hits=-0.8 required=10.0 tests=AWL,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Apr 2010 10:04:27 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id 8B8A8D36006 for ; Thu, 8 Apr 2010 12:04:05 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uis5NT4m1dpb for ; Thu, 8 Apr 2010 12:04:00 +0200 (CEST) Received: from VEGA (WDC-MARE.marum.de [134.102.249.74]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id 21E0FD36005 for ; Thu, 8 Apr 2010 12:04:00 +0200 (CEST) From: "Uwe Schindler" To: References: <007e01cad6ee$10c8dc40$325a94c0$@de> In-Reply-To: Subject: RE: IndexWriter memory leak? Date: Thu, 8 Apr 2010 12:04:02 +0200 Message-ID: <00c301cad702$d10a4ca0$731ee5e0$@de> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: AcrXAlF0Bum3sfJ0Q/ymD5gZdZj4MwAAHOaA Content-Language: de I already answered, that I will take care of this! Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: Shai Erera [mailto:serera@gmail.com] > Sent: Thursday, April 08, 2010 12:00 PM > To: java-user@lucene.apache.org > Subject: Re: IndexWriter memory leak? >=20 > Yes, that's the trimBuffer version I was thinking about, only this guy > created a reset(Reader, int) and does both ops (resetting + trim) in > one > method call. More convenient. Can you please open an issue to track > that? > People will have a chance to comment on whether we (Lucene) should > handle > that, or it should be a JFlex fix. Based on the number of replies this > guy > received (0 !), I doubt JFlex would consider it a problem. But we can > do > some small service to our users base by protecting against such > problems. >=20 > And while you're opening the issue, if you want to take a stab at > fixing it > and post a patch, it'd be great :). >=20 > Shai >=20 > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna > wrote: >=20 > > I was investigating this a little further and in the JFlex mailing > list I > > found [1] > > > > I don't know much about flex / JFlex but it seems that this guy > resets the > > zzBuffer to 16384 or less when setting the input for the lexer > > > > > > Quoted from shef > > > > > > I set > > > > %buffer 0 > > > > in the options section, and then added this method to the lexer: > > > > /** > > * Set the input for the lexer. The size parameter really speeds > things > > up, > > * because by default, the lexer allocates an internal buffer of > 16k. > > For > > * most strings, this is unnecessarily large. If the size param = is > > 0 or greater > > * than 16k, then the buffer is set to 16k. If the size param is > > smaller, then > > * the buf will be set to the exact size. > > * @param r the reader that provides the data > > * @param the size of the data in the reader. > > */ > > public void reset(Reader r, int size) { > > if (size =3D=3D 0 || size > 16384) > > size =3D 16384; > > zzBuffer =3D new char[size]; > > yyreset(r); > > } > > > > > > So maybe there is a way to trim the zzBuffer this way (?). > > > > BTW, I will try to find out which is the "big token" in my dataset > this > > afternoon. Thanks for the help. > > > > I actually workaround this memory problem for the time being by > wrapping > > the > > IndexWriter in a class that periodically closes the IndexWriter and > creates > > a new one, allowing the old to be GCed, but I would be really good = if > > either > > JFlex or Lucene can take care of this zzBuffer going berserk. > > > > > > Again thanks for the quick response. /Rub=C3=A9n > > > > > > [1] > > > > > = https://sourceforge.net/mailarchive/message.php?msg_id=3D444070.38422.qm@= > web38901.mail.mud.yahoo.com > > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera = wrote: > > > > > If we could change the Flex file so that yyreset(Reader) would > check the > > > size of zzBuffer, we could trim it when it gets too big. But I > don't > > think > > > we have such control when writing the flex syntax ... yyreset is > > generated > > > by JFlex and that's the only place I can think of to trim the > buffer down > > > when it exceeds a predefined threshold .... > > > > > > Maybe what we can do is create our own method which will be called > by > > > StandardTokenizer after yyreset is called, something like > > > trimBufferIfTooBig(int threshold) which will reallocate zzBuffer = if > it > > > exceeded the threshold. We can decide on a reasonable 64K = threshold > or > > > something, or simply always cut back to 16 KB. As far as I > understand, > > that > > > buffer should never grow that much. I.e. in zzRefill, which is the > only > > > place where the buffer gets resized, there is an attempt to first > move > > back > > > characters that were already consumed and only then allocate a > bigger > > > buffer. Which means only if there is a token whose size is larger > than > > 16KB > > > (!?), will this buffer get expanded. > > > > > > A trimBuffer method might not be that bad .. as a protective > measure. > > What > > > do you think? Of course, JFlex can fix it on their own ... but > until that > > > happens ... > > > > > > Shai > > > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler > wrote: > > > > > > > > I would like to identify also the problematic document I have > 10000 > > so, > > > > > what > > > > > would be the best way of identifying the one that it making > zzBuffer > > to > > > > > grow > > > > > without control? > > > > > > > > Dont index your documents, but instead pass them directly to the > > analyzer > > > > and consume the tokenstream manually. Then visit > > > TermAttribute.termLength() > > > > for each Token. > > > > > > > > > > > > = ----------------------------------------------------------------- > ---- > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > > For additional commands, e-mail: = java-user-help@lucene.apache.org > > > > > > > > > > > > > > > > > > > -- > > /Rub=C3=A9n > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org