Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <m2if8d56a4d1004082154keb9b1f27n5015e16e08768ad7@mail.gmail.com>
References: <t2mf8d56a4d1004071335m90081304iafea6f5ed1cf8ce@mail.gmail.com>
	 <x2zf8d56a4d1004080251rf223c873gf99f5804b2f02a88@mail.gmail.com>
	 <t2q786fde51004080259qbc5501bbn5c0faf6a3bb5dd7d@mail.gmail.com>
	 <00c301cad702$d10a4ca0$731ee5e0$@de>
	 <q2r786fde51004080311ga61eb3e8u51fcb88a771f713d@mail.gmail.com>
	 <k2mf8d56a4d1004080427yc484e7e1zd4c9ce95923e98e8@mail.gmail.com>
	 <00fe01cad715$54c5ba90$fe512fb0$@de>
	 <q2uf8d56a4d1004080550s6717849jbd4570de5f2c5817@mail.gmail.com>
	 <013601cad765$5fa5ca90$1ef15fb0$@de>
	 <m2if8d56a4d1004082154keb9b1f27n5015e16e08768ad7@mail.gmail.com>
Date: Fri, 9 Apr 2010 06:32:21 -0400
Message-ID: <t2q9ac0c6aa1004090332nb67b2202oc897fb3ec6afc501@mail.gmail.com>
Subject: Re: IndexWriter memory leak?
From: Michael McCandless <lucene@mikemccandless.com>
To: java-dev@lucene.apache.org, Ruben Laguna <ruben.laguna@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I agree IW should not hold refs to the Field instances from the last
doc indexed... I put a patch on LUCENE-2387 to null the reference as
we go.  Can you confirm this lets GC reclaim?

Mike

On Fri, Apr 9, 2010 at 12:54 AM, Ruben Laguna <ruben.laguna@gmail.com> wrot=
e:
> But the Readers I'm talking about are not held by the Tokenizer (at least
> not *only* by it), these are held by the DocFieldProccessorPerThread....
>
> IndexWriter -> DocumentsWriter -> DocumentsWriterThreadState ->
> DocFieldProcessorPerThread -> DocFieldProcessorPerField -> Fieldable ->
> Field (fieldsData)
>
> and it's not only one Reader there are several one (one per thread I
> suppose, in my heapdump there is 25 Reader that should have been GCed
> otherwise).
>
> Best regards/Ruben
> On Thu, Apr 8, 2010 at 11:49 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>
>> There is one possibility, that could be fixed:
>>
>> As Tokenizers are reused, the analyzer holds a reference to the last use=
d
>> Reader. The easy fix would be to unset the Reader in Tokenizer.close(). =
If
>> this is the case for you, that may be easy to do. So Tokenizer.close() l=
ooks
>> like this:
>>
>> =A0/** By default, closes the input Reader. */
>> =A0@Override
>> =A0public void close() throws IOException {
>> =A0 =A0input.close();
>> =A0 =A0input =3D null; // <-- new!
>> =A0}
>>
>> Any comments from other committers?
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>> > -----Original Message-----
>> > From: Ruben Laguna [mailto:ruben.laguna@gmail.com]
>> > Sent: Thursday, April 08, 2010 2:50 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Re: IndexWriter memory leak?
>> >
>> > I will double check in the afternoon the heapdump.hprof. But I think
>> > that
>> > *some* readers are indeed held by
>> > docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx],
>> > as shown in [1] (this heapdump contains only live objects). =A0The
>> > heapdump
>> > was taken after IndexWriter.commit() /IndexWriter.optimize() and all
>> > the
>> > Documents were already indexed and GCed (I will double check).
>> >
>> > So that would mean that the Reader is retained in memory by the
>> > following
>> > chaing of references,
>> >
>> > DocumentsWriter -> DocumentsWriterThreadState ->
>> > DocFieldProcessorPerThread
>> > -> DocFieldProcessorPerField -> Fieldable -> Field (fieldsData)
>> >
>> > I'll double check with Eclipse MAT as I said that this chain is
>> > actually
>> > made of hard references only (no SoftReferences,WeakReferences, etc). =
I
>> > will
>> > also double check also that there is no "live" Document that is
>> > referencing
>> > the Reader via the Field.
>> >
>> >
>> > [1] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg
>> >
>> > On Thu, Apr 8, 2010 at 2:16 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>> >
>> > > Readers are not held. If you indexed the document and gced the
>> > document
>> > > instance they readers are gone.
>> > >
>> > > -----
>> > > Uwe Schindler
>> > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > http://www.thetaphi.de
>> > > eMail: uwe@thetaphi.de
>> > >
>> > >
>> > > > -----Original Message-----
>> > > > From: Ruben Laguna [mailto:ruben.laguna@gmail.com]
>> > > > Sent: Thursday, April 08, 2010 1:28 PM
>> > > > To: java-user@lucene.apache.org
>> > > > Subject: Re: IndexWriter memory leak?
>> > > >
>> > > > Now that the zzBuffer issue is solved...
>> > > >
>> > > > what about the references to the Readers held by docWriter. Tika=
=B4s
>> > > > ParsingReaders are quite heavyweight so retaining those in memory
>> > > > unnecesarily is also a "hidden" memory leak. Should I open a bug
>> > report
>> > > > on
>> > > > that one?
>> > > >
>> > > > /Rub=E9n
>> > > >
>> > > > On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera <serera@gmail.com>
>> > wrote:
>> > > >
>> > > > > Guess we were replying at the same time :).
>> > > > >
>> > > > > On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <uwe@thetaphi.de>
>> > > > wrote:
>> > > > >
>> > > > > > I already answered, that I will take care of this!
>> > > > > >
>> > > > > > Uwe
>> > > > > >
>> > > > > > -----
>> > > > > > Uwe Schindler
>> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > > > > http://www.thetaphi.de
>> > > > > > eMail: uwe@thetaphi.de
>> > > > > >
>> > > > > >
>> > > > > > > -----Original Message-----
>> > > > > > > From: Shai Erera [mailto:serera@gmail.com]
>> > > > > > > Sent: Thursday, April 08, 2010 12:00 PM
>> > > > > > > To: java-user@lucene.apache.org
>> > > > > > > Subject: Re: IndexWriter memory leak?
>> > > > > > >
>> > > > > > > Yes, that's the trimBuffer version I was thinking about, onl=
y
>> > > > this guy
>> > > > > > > created a reset(Reader, int) and does both ops (resetting +
>> > trim)
>> > > > in
>> > > > > > > one
>> > > > > > > method call. More convenient. Can you please open an issue t=
o
>> > > > track
>> > > > > > > that?
>> > > > > > > People will have a chance to comment on whether we (Lucene)
>> > > > should
>> > > > > > > handle
>> > > > > > > that, or it should be a JFlex fix. Based on the number of
>> > replies
>> > > > this
>> > > > > > > guy
>> > > > > > > received (0 !), I doubt JFlex would consider it a problem.
>> > But we
>> > > > can
>> > > > > > > do
>> > > > > > > some small service to our users base by protecting against
>> > such
>> > > > > > > problems.
>> > > > > > >
>> > > > > > > And while you're opening the issue, if you want to take a
>> > stab at
>> > > > > > > fixing it
>> > > > > > > and post a patch, it'd be great :).
>> > > > > > >
>> > > > > > > Shai
>> > > > > > >
>> > > > > > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna
>> > > > > > > <ruben.laguna@gmail.com>wrote:
>> > > > > > >
>> > > > > > > > I was investigating this a little further and in the JFlex
>> > > > mailing
>> > > > > > > list I
>> > > > > > > > found [1]
>> > > > > > > >
>> > > > > > > > I don't know much about flex / JFlex but it seems that thi=
s
>> > guy
>> > > > > > > resets the
>> > > > > > > > zzBuffer to 16384 or less when setting the input for the
>> > lexer
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Quoted from =A0shef <shef31@ya...>
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I set
>> > > > > > > >
>> > > > > > > > %buffer 0
>> > > > > > > >
>> > > > > > > > in the options section, and then added this method to the
>> > > > lexer:
>> > > > > > > >
>> > > > > > > > =A0 =A0/**
>> > > > > > > > =A0 =A0 * Set the input for the lexer. The size parameter
>> > really
>> > > > speeds
>> > > > > > > things
>> > > > > > > > up,
>> > > > > > > > =A0 =A0 * because by default, the lexer allocates an inter=
nal
>> > > > buffer of
>> > > > > > > 16k.
>> > > > > > > > For
>> > > > > > > > =A0 =A0 * most strings, this is unnecessarily large. If th=
e
>> > size
>> > > > param is
>> > > > > > > > 0 or greater
>> > > > > > > > =A0 =A0 * than 16k, then the buffer is set to 16k. If the =
size
>> > > > param is
>> > > > > > > > smaller, then
>> > > > > > > > =A0 =A0 * the buf will be set to the exact size.
>> > > > > > > > =A0 =A0 * @param r the reader that provides the data
>> > > > > > > > =A0 =A0 * @param the size of the data in the reader.
>> > > > > > > > =A0 =A0 */
>> > > > > > > > =A0 =A0public void reset(Reader r, int size) {
>> > > > > > > > =A0 =A0 =A0 =A0if (size =3D=3D 0 || size > 16384)
>> > > > > > > > =A0 =A0 =A0 =A0 =A0 =A0size =3D 16384;
>> > > > > > > > =A0 =A0 =A0 =A0zzBuffer =3D new char[size];
>> > > > > > > > =A0 =A0 =A0 =A0yyreset(r);
>> > > > > > > > =A0 =A0}
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > So maybe there is a way to trim the zzBuffer this way (?).
>> > > > > > > >
>> > > > > > > > BTW, I will try to find out which is the "big token" in my
>> > > > dataset
>> > > > > > > this
>> > > > > > > > afternoon. Thanks for the help.
>> > > > > > > >
>> > > > > > > > I actually workaround this memory problem for the time
>> > being by
>> > > > > > > wrapping
>> > > > > > > > the
>> > > > > > > > IndexWriter in a class that periodically closes the
>> > IndexWriter
>> > > > and
>> > > > > > > creates
>> > > > > > > > a new one, allowing the old to be GCed, but I would be
>> > really
>> > > > good if
>> > > > > > > > either
>> > > > > > > > JFlex or Lucene can take care of this zzBuffer going
>> > berserk.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Again thanks for the quick response. /Rub=E9n
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > [1]
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > >
>> > > >
>> > https://sourceforge.net/mailarchive/message.php?msg_id=3D444070.38422.=
qm@
>> > > > > > > web38901.mail.mud.yahoo.com
>> > > > > > > >
>> > > > > > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera
>> > <serera@gmail.com>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > If we could change the Flex file so that yyreset(Reader)
>> > > > would
>> > > > > > > check the
>> > > > > > > > > size of zzBuffer, we could trim it when it gets too big.
>> > But
>> > > > I
>> > > > > > > don't
>> > > > > > > > think
>> > > > > > > > > we have such control when writing the flex syntax ...
>> > yyreset
>> > > > is
>> > > > > > > > generated
>> > > > > > > > > by JFlex and that's the only place I can think of to tri=
m
>> > the
>> > > > > > > buffer down
>> > > > > > > > > when it exceeds a predefined threshold ....
>> > > > > > > > >
>> > > > > > > > > Maybe what we can do is create our own method which will
>> > be
>> > > > called
>> > > > > > > by
>> > > > > > > > > StandardTokenizer after yyreset is called, something lik=
e
>> > > > > > > > > trimBufferIfTooBig(int threshold) which will reallocate
>> > > > zzBuffer if
>> > > > > > > it
>> > > > > > > > > exceeded the threshold. We can decide on a reasonable 64=
K
>> > > > threshold
>> > > > > > > or
>> > > > > > > > > something, or simply always cut back to 16 KB. As far as
>> > I
>> > > > > > > understand,
>> > > > > > > > that
>> > > > > > > > > buffer should never grow that much. I.e. in zzRefill,
>> > which
>> > > > is the
>> > > > > > > only
>> > > > > > > > > place where the buffer gets resized, there is an attempt
>> > to
>> > > > first
>> > > > > > > move
>> > > > > > > > back
>> > > > > > > > > characters that were already consumed and only then
>> > allocate
>> > > > a
>> > > > > > > bigger
>> > > > > > > > > buffer. Which means only if there is a token whose size
>> > is
>> > > > larger
>> > > > > > > than
>> > > > > > > > 16KB
>> > > > > > > > > (!?), will this buffer get expanded.
>> > > > > > > > >
>> > > > > > > > > A trimBuffer method might not be that bad .. as a
>> > protective
>> > > > > > > measure.
>> > > > > > > > What
>> > > > > > > > > do you think? Of course, JFlex can fix it on their own
>> > ...
>> > > > but
>> > > > > > > until that
>> > > > > > > > > happens ...
>> > > > > > > > >
>> > > > > > > > > Shai
>> > > > > > > > >
>> > > > > > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler
>> > > > <uwe@thetaphi.de>
>> > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > > I would like to identify also the problematic
>> > document I
>> > > > have
>> > > > > > > 10000
>> > > > > > > > so,
>> > > > > > > > > > > what
>> > > > > > > > > > > would be the best way of identifying the one that it
>> > > > making
>> > > > > > > zzBuffer
>> > > > > > > > to
>> > > > > > > > > > > grow
>> > > > > > > > > > > without control?
>> > > > > > > > > >
>> > > > > > > > > > Dont index your documents, but instead pass them
>> > directly
>> > > > to the
>> > > > > > > > analyzer
>> > > > > > > > > > and consume the tokenstream manually. Then visit
>> > > > > > > > > TermAttribute.termLength()
>> > > > > > > > > > for each Token.
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > ------------------------------------------------------=
-
>> > ----
>> > > > ------
>> > > > > > > ----
>> > > > > > > > > > To unsubscribe, e-mail: java-user-
>> > > > unsubscribe@lucene.apache.org
>> > > > > > > > > > For additional commands, e-mail:
>> > > > > java-user-help@lucene.apache.org
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > /Rub=E9n
>> > > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --------------------------------------------------------------=
-
>> > ----
>> > > > --
>> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.or=
g
>> > > > > > For additional commands, e-mail: java-user-
>> > help@lucene.apache.org
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > /Rub=E9n
>> > >
>> > >
>> > > --------------------------------------------------------------------=
-
>> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: java-user-help@lucene.apache.org
>> > >
>> > >
>> >
>> >
>> > --
>> > /Rub=E9n
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> /Rub=E9n
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org