lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: IndexWriter memory leak?
Date Sat, 10 Apr 2010 09:32:12 GMT
Thank you for isolating & raising the issue!!

Mike

On Sat, Apr 10, 2010 at 1:28 AM, Ruben Laguna <ruben.laguna@gmail.com> wrote:
> I just tried the changes that you commited it works beatifully. The readers
> are GCed. Thanks for both LUCENE-2387 and LUCENE-2384. Those make a big
> difference in my app!
>
> On Fri, Apr 9, 2010 at 12:32 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>>
>> I agree IW should not hold refs to the Field instances from the last
>> doc indexed... I put a patch on LUCENE-2387 to null the reference as
>> we go.  Can you confirm this lets GC reclaim?
>>
>> Mike
>>
>> On Fri, Apr 9, 2010 at 12:54 AM, Ruben Laguna <ruben.laguna@gmail.com>
>> wrote:
>> > But the Readers I'm talking about are not held by the Tokenizer (at
>> > least
>> > not *only* by it), these are held by the DocFieldProccessorPerThread....
>> >
>> > IndexWriter -> DocumentsWriter -> DocumentsWriterThreadState ->
>> > DocFieldProcessorPerThread -> DocFieldProcessorPerField -> Fieldable ->
>> > Field (fieldsData)
>> >
>> > and it's not only one Reader there are several one (one per thread I
>> > suppose, in my heapdump there is 25 Reader that should have been GCed
>> > otherwise).
>> >
>> > Best regards/Ruben
>> > On Thu, Apr 8, 2010 at 11:49 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>> >
>> >> There is one possibility, that could be fixed:
>> >>
>> >> As Tokenizers are reused, the analyzer holds a reference to the last
>> >> used
>> >> Reader. The easy fix would be to unset the Reader in Tokenizer.close().
>> >> If
>> >> this is the case for you, that may be easy to do. So Tokenizer.close()
>> >> looks
>> >> like this:
>> >>
>> >>  /** By default, closes the input Reader. */
>> >>  @Override
>> >>  public void close() throws IOException {
>> >>    input.close();
>> >>    input = null; // <-- new!
>> >>  }
>> >>
>> >> Any comments from other committers?
>> >>
>> >> -----
>> >> Uwe Schindler
>> >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> http://www.thetaphi.de
>> >> eMail: uwe@thetaphi.de
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: Ruben Laguna [mailto:ruben.laguna@gmail.com]
>> >> > Sent: Thursday, April 08, 2010 2:50 PM
>> >> > To: java-user@lucene.apache.org
>> >> > Subject: Re: IndexWriter memory leak?
>> >> >
>> >> > I will double check in the afternoon the heapdump.hprof. But I think
>> >> > that
>> >> > *some* readers are indeed held by
>> >> > docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx],
>> >> > as shown in [1] (this heapdump contains only live objects).  The
>> >> > heapdump
>> >> > was taken after IndexWriter.commit() /IndexWriter.optimize() and all
>> >> > the
>> >> > Documents were already indexed and GCed (I will double check).
>> >> >
>> >> > So that would mean that the Reader is retained in memory by the
>> >> > following
>> >> > chaing of references,
>> >> >
>> >> > DocumentsWriter -> DocumentsWriterThreadState ->
>> >> > DocFieldProcessorPerThread
>> >> > -> DocFieldProcessorPerField -> Fieldable -> Field (fieldsData)
>> >> >
>> >> > I'll double check with Eclipse MAT as I said that this chain is
>> >> > actually
>> >> > made of hard references only (no SoftReferences,WeakReferences, etc).
>> >> > I
>> >> > will
>> >> > also double check also that there is no "live" Document that is
>> >> > referencing
>> >> > the Reader via the Field.
>> >> >
>> >> >
>> >> > [1] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg
>> >> >
>> >> > On Thu, Apr 8, 2010 at 2:16 PM, Uwe Schindler <uwe@thetaphi.de>
>> >> > wrote:
>> >> >
>> >> > > Readers are not held. If you indexed the document and gced the
>> >> > document
>> >> > > instance they readers are gone.
>> >> > >
>> >> > > -----
>> >> > > Uwe Schindler
>> >> > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> > > http://www.thetaphi.de
>> >> > > eMail: uwe@thetaphi.de
>> >> > >
>> >> > >
>> >> > > > -----Original Message-----
>> >> > > > From: Ruben Laguna [mailto:ruben.laguna@gmail.com]
>> >> > > > Sent: Thursday, April 08, 2010 1:28 PM
>> >> > > > To: java-user@lucene.apache.org
>> >> > > > Subject: Re: IndexWriter memory leak?
>> >> > > >
>> >> > > > Now that the zzBuffer issue is solved...
>> >> > > >
>> >> > > > what about the references to the Readers held by docWriter.
>> >> > > > Tika´s
>> >> > > > ParsingReaders are quite heavyweight so retaining those in
memory
>> >> > > > unnecesarily is also a "hidden" memory leak. Should I open
a bug
>> >> > report
>> >> > > > on
>> >> > > > that one?
>> >> > > >
>> >> > > > /Rubén
>> >> > > >
>> >> > > > On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera <serera@gmail.com>
>> >> > wrote:
>> >> > > >
>> >> > > > > Guess we were replying at the same time :).
>> >> > > > >
>> >> > > > > On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <uwe@thetaphi.de>
>> >> > > > wrote:
>> >> > > > >
>> >> > > > > > I already answered, that I will take care of this!
>> >> > > > > >
>> >> > > > > > Uwe
>> >> > > > > >
>> >> > > > > > -----
>> >> > > > > > Uwe Schindler
>> >> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> > > > > > http://www.thetaphi.de
>> >> > > > > > eMail: uwe@thetaphi.de
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > > -----Original Message-----
>> >> > > > > > > From: Shai Erera [mailto:serera@gmail.com]
>> >> > > > > > > Sent: Thursday, April 08, 2010 12:00 PM
>> >> > > > > > > To: java-user@lucene.apache.org
>> >> > > > > > > Subject: Re: IndexWriter memory leak?
>> >> > > > > > >
>> >> > > > > > > Yes, that's the trimBuffer version I was thinking
about,
>> >> > > > > > > only
>> >> > > > this guy
>> >> > > > > > > created a reset(Reader, int) and does both
ops (resetting +
>> >> > trim)
>> >> > > > in
>> >> > > > > > > one
>> >> > > > > > > method call. More convenient. Can you please
open an issue
>> >> > > > > > > to
>> >> > > > track
>> >> > > > > > > that?
>> >> > > > > > > People will have a chance to comment on whether
we (Lucene)
>> >> > > > should
>> >> > > > > > > handle
>> >> > > > > > > that, or it should be a JFlex fix. Based on
the number of
>> >> > replies
>> >> > > > this
>> >> > > > > > > guy
>> >> > > > > > > received (0 !), I doubt JFlex would consider
it a problem.
>> >> > But we
>> >> > > > can
>> >> > > > > > > do
>> >> > > > > > > some small service to our users base by protecting
against
>> >> > such
>> >> > > > > > > problems.
>> >> > > > > > >
>> >> > > > > > > And while you're opening the issue, if you
want to take a
>> >> > stab at
>> >> > > > > > > fixing it
>> >> > > > > > > and post a patch, it'd be great :).
>> >> > > > > > >
>> >> > > > > > > Shai
>> >> > > > > > >
>> >> > > > > > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna
>> >> > > > > > > <ruben.laguna@gmail.com>wrote:
>> >> > > > > > >
>> >> > > > > > > > I was investigating this a little further
and in the
>> >> > > > > > > > JFlex
>> >> > > > mailing
>> >> > > > > > > list I
>> >> > > > > > > > found [1]
>> >> > > > > > > >
>> >> > > > > > > > I don't know much about flex / JFlex
but it seems that
>> >> > > > > > > > this
>> >> > guy
>> >> > > > > > > resets the
>> >> > > > > > > > zzBuffer to 16384 or less when setting
the input for the
>> >> > lexer
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Quoted from  shef <shef31@ya...>
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > I set
>> >> > > > > > > >
>> >> > > > > > > > %buffer 0
>> >> > > > > > > >
>> >> > > > > > > > in the options section, and then added
this method to the
>> >> > > > lexer:
>> >> > > > > > > >
>> >> > > > > > > >    /**
>> >> > > > > > > >     * Set the input for the lexer.
The size parameter
>> >> > really
>> >> > > > speeds
>> >> > > > > > > things
>> >> > > > > > > > up,
>> >> > > > > > > >     * because by default, the lexer
allocates an internal
>> >> > > > buffer of
>> >> > > > > > > 16k.
>> >> > > > > > > > For
>> >> > > > > > > >     * most strings, this is unnecessarily
large. If the
>> >> > size
>> >> > > > param is
>> >> > > > > > > > 0 or greater
>> >> > > > > > > >     * than 16k, then the buffer is
set to 16k. If the
>> >> > > > > > > > size
>> >> > > > param is
>> >> > > > > > > > smaller, then
>> >> > > > > > > >     * the buf will be set to the exact
size.
>> >> > > > > > > >     * @param r the reader that provides
the data
>> >> > > > > > > >     * @param the size of the data in
the reader.
>> >> > > > > > > >     */
>> >> > > > > > > >    public void reset(Reader r, int
size) {
>> >> > > > > > > >        if (size == 0 || size >
16384)
>> >> > > > > > > >            size = 16384;
>> >> > > > > > > >        zzBuffer = new char[size];
>> >> > > > > > > >        yyreset(r);
>> >> > > > > > > >    }
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > So maybe there is a way to trim the zzBuffer
this way
>> >> > > > > > > > (?).
>> >> > > > > > > >
>> >> > > > > > > > BTW, I will try to find out which is
the "big token" in
>> >> > > > > > > > my
>> >> > > > dataset
>> >> > > > > > > this
>> >> > > > > > > > afternoon. Thanks for the help.
>> >> > > > > > > >
>> >> > > > > > > > I actually workaround this memory problem
for the time
>> >> > being by
>> >> > > > > > > wrapping
>> >> > > > > > > > the
>> >> > > > > > > > IndexWriter in a class that periodically
closes the
>> >> > IndexWriter
>> >> > > > and
>> >> > > > > > > creates
>> >> > > > > > > > a new one, allowing the old to be GCed,
but I would be
>> >> > really
>> >> > > > good if
>> >> > > > > > > > either
>> >> > > > > > > > JFlex or Lucene can take care of this
zzBuffer going
>> >> > berserk.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Again thanks for the quick response.
/Rubén
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > [1]
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > >
>> >> > > >
>> >> >
>> >> > https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@
>> >> > > > > > > web38901.mail.mud.yahoo.com
>> >> > > > > > > >
>> >> > > > > > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai
Erera
>> >> > <serera@gmail.com>
>> >> > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > If we could change the Flex file
so that
>> >> > > > > > > > > yyreset(Reader)
>> >> > > > would
>> >> > > > > > > check the
>> >> > > > > > > > > size of zzBuffer, we could trim
it when it gets too
>> >> > > > > > > > > big.
>> >> > But
>> >> > > > I
>> >> > > > > > > don't
>> >> > > > > > > > think
>> >> > > > > > > > > we have such control when writing
the flex syntax ...
>> >> > yyreset
>> >> > > > is
>> >> > > > > > > > generated
>> >> > > > > > > > > by JFlex and that's the only place
I can think of to
>> >> > > > > > > > > trim
>> >> > the
>> >> > > > > > > buffer down
>> >> > > > > > > > > when it exceeds a predefined threshold
....
>> >> > > > > > > > >
>> >> > > > > > > > > Maybe what we can do is create our
own method which
>> >> > > > > > > > > will
>> >> > be
>> >> > > > called
>> >> > > > > > > by
>> >> > > > > > > > > StandardTokenizer after yyreset
is called, something
>> >> > > > > > > > > like
>> >> > > > > > > > > trimBufferIfTooBig(int threshold)
which will reallocate
>> >> > > > zzBuffer if
>> >> > > > > > > it
>> >> > > > > > > > > exceeded the threshold. We can decide
on a reasonable
>> >> > > > > > > > > 64K
>> >> > > > threshold
>> >> > > > > > > or
>> >> > > > > > > > > something, or simply always cut
back to 16 KB. As far
>> >> > > > > > > > > as
>> >> > I
>> >> > > > > > > understand,
>> >> > > > > > > > that
>> >> > > > > > > > > buffer should never grow that much.
I.e. in zzRefill,
>> >> > which
>> >> > > > is the
>> >> > > > > > > only
>> >> > > > > > > > > place where the buffer gets resized,
there is an
>> >> > > > > > > > > attempt
>> >> > to
>> >> > > > first
>> >> > > > > > > move
>> >> > > > > > > > back
>> >> > > > > > > > > characters that were already consumed
and only then
>> >> > allocate
>> >> > > > a
>> >> > > > > > > bigger
>> >> > > > > > > > > buffer. Which means only if there
is a token whose size
>> >> > is
>> >> > > > larger
>> >> > > > > > > than
>> >> > > > > > > > 16KB
>> >> > > > > > > > > (!?), will this buffer get expanded.
>> >> > > > > > > > >
>> >> > > > > > > > > A trimBuffer method might not be
that bad .. as a
>> >> > protective
>> >> > > > > > > measure.
>> >> > > > > > > > What
>> >> > > > > > > > > do you think? Of course, JFlex can
fix it on their own
>> >> > ...
>> >> > > > but
>> >> > > > > > > until that
>> >> > > > > > > > > happens ...
>> >> > > > > > > > >
>> >> > > > > > > > > Shai
>> >> > > > > > > > >
>> >> > > > > > > > > On Thu, Apr 8, 2010 at 10:35 AM,
Uwe Schindler
>> >> > > > <uwe@thetaphi.de>
>> >> > > > > > > wrote:
>> >> > > > > > > > >
>> >> > > > > > > > > > > I would like to identify
also the problematic
>> >> > document I
>> >> > > > have
>> >> > > > > > > 10000
>> >> > > > > > > > so,
>> >> > > > > > > > > > > what
>> >> > > > > > > > > > > would be the best way
of identifying the one that
>> >> > > > > > > > > > > it
>> >> > > > making
>> >> > > > > > > zzBuffer
>> >> > > > > > > > to
>> >> > > > > > > > > > > grow
>> >> > > > > > > > > > > without control?
>> >> > > > > > > > > >
>> >> > > > > > > > > > Dont index your documents,
but instead pass them
>> >> > directly
>> >> > > > to the
>> >> > > > > > > > analyzer
>> >> > > > > > > > > > and consume the tokenstream
manually. Then visit
>> >> > > > > > > > > TermAttribute.termLength()
>> >> > > > > > > > > > for each Token.
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > -------------------------------------------------------
>> >> > ----
>> >> > > > ------
>> >> > > > > > > ----
>> >> > > > > > > > > > To unsubscribe, e-mail: java-user-
>> >> > > > unsubscribe@lucene.apache.org
>> >> > > > > > > > > > For additional commands, e-mail:
>> >> > > > > java-user-help@lucene.apache.org
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > --
>> >> > > > > > > > /Rubén
>> >> > > > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > ---------------------------------------------------------------
>> >> > ----
>> >> > > > --
>> >> > > > > > To unsubscribe, e-mail:
>> >> > > > > > java-user-unsubscribe@lucene.apache.org
>> >> > > > > > For additional commands, e-mail: java-user-
>> >> > help@lucene.apache.org
>> >> > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > --
>> >> > > > /Rubén
>> >> > >
>> >> > >
>> >> > >
>> >> > > ---------------------------------------------------------------------
>> >> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> > >
>> >> > >
>> >> >
>> >> >
>> >> > --
>> >> > /Rubén
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > /Rubén
>> >
>
>
>
> --
> /Rubén
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message