lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ruben Laguna <ruben.lag...@gmail.com>
Subject Re: IndexWriter memory leak?
Date Thu, 08 Apr 2010 12:50:27 GMT
I will double check in the afternoon the heapdump.hprof. But I think that
*some* readers are indeed held by
docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx],
as shown in [1] (this heapdump contains only live objects).  The heapdump
was taken after IndexWriter.commit() /IndexWriter.optimize() and all the
Documents were already indexed and GCed (I will double check).

So that would mean that the Reader is retained in memory by the following
chaing of references,

DocumentsWriter -> DocumentsWriterThreadState -> DocFieldProcessorPerThread
-> DocFieldProcessorPerField -> Fieldable -> Field (fieldsData)

I'll double check with Eclipse MAT as I said that this chain is actually
made of hard references only (no SoftReferences,WeakReferences, etc). I will
also double check also that there is no "live" Document that is referencing
the Reader via the Field.


[1] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg

On Thu, Apr 8, 2010 at 2:16 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> Readers are not held. If you indexed the document and gced the document
> instance they readers are gone.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Ruben Laguna [mailto:ruben.laguna@gmail.com]
> > Sent: Thursday, April 08, 2010 1:28 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: IndexWriter memory leak?
> >
> > Now that the zzBuffer issue is solved...
> >
> > what about the references to the Readers held by docWriter. Tika´s
> > ParsingReaders are quite heavyweight so retaining those in memory
> > unnecesarily is also a "hidden" memory leak. Should I open a bug report
> > on
> > that one?
> >
> > /Rubén
> >
> > On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera <serera@gmail.com> wrote:
> >
> > > Guess we were replying at the same time :).
> > >
> > > On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <uwe@thetaphi.de>
> > wrote:
> > >
> > > > I already answered, that I will take care of this!
> > > >
> > > > Uwe
> > > >
> > > > -----
> > > > Uwe Schindler
> > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > http://www.thetaphi.de
> > > > eMail: uwe@thetaphi.de
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Shai Erera [mailto:serera@gmail.com]
> > > > > Sent: Thursday, April 08, 2010 12:00 PM
> > > > > To: java-user@lucene.apache.org
> > > > > Subject: Re: IndexWriter memory leak?
> > > > >
> > > > > Yes, that's the trimBuffer version I was thinking about, only
> > this guy
> > > > > created a reset(Reader, int) and does both ops (resetting + trim)
> > in
> > > > > one
> > > > > method call. More convenient. Can you please open an issue to
> > track
> > > > > that?
> > > > > People will have a chance to comment on whether we (Lucene)
> > should
> > > > > handle
> > > > > that, or it should be a JFlex fix. Based on the number of replies
> > this
> > > > > guy
> > > > > received (0 !), I doubt JFlex would consider it a problem. But we
> > can
> > > > > do
> > > > > some small service to our users base by protecting against such
> > > > > problems.
> > > > >
> > > > > And while you're opening the issue, if you want to take a stab at
> > > > > fixing it
> > > > > and post a patch, it'd be great :).
> > > > >
> > > > > Shai
> > > > >
> > > > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna
> > > > > <ruben.laguna@gmail.com>wrote:
> > > > >
> > > > > > I was investigating this a little further and in the JFlex
> > mailing
> > > > > list I
> > > > > > found [1]
> > > > > >
> > > > > > I don't know much about flex / JFlex but it seems that this
guy
> > > > > resets the
> > > > > > zzBuffer to 16384 or less when setting the input for the lexer
> > > > > >
> > > > > >
> > > > > > Quoted from  shef <shef31@ya...>
> > > > > >
> > > > > >
> > > > > > I set
> > > > > >
> > > > > > %buffer 0
> > > > > >
> > > > > > in the options section, and then added this method to the
> > lexer:
> > > > > >
> > > > > >    /**
> > > > > >     * Set the input for the lexer. The size parameter really
> > speeds
> > > > > things
> > > > > > up,
> > > > > >     * because by default, the lexer allocates an internal
> > buffer of
> > > > > 16k.
> > > > > > For
> > > > > >     * most strings, this is unnecessarily large. If the size
> > param is
> > > > > > 0 or greater
> > > > > >     * than 16k, then the buffer is set to 16k. If the size
> > param is
> > > > > > smaller, then
> > > > > >     * the buf will be set to the exact size.
> > > > > >     * @param r the reader that provides the data
> > > > > >     * @param the size of the data in the reader.
> > > > > >     */
> > > > > >    public void reset(Reader r, int size) {
> > > > > >        if (size == 0 || size > 16384)
> > > > > >            size = 16384;
> > > > > >        zzBuffer = new char[size];
> > > > > >        yyreset(r);
> > > > > >    }
> > > > > >
> > > > > >
> > > > > > So maybe there is a way to trim the zzBuffer this way (?).
> > > > > >
> > > > > > BTW, I will try to find out which is the "big token" in my
> > dataset
> > > > > this
> > > > > > afternoon. Thanks for the help.
> > > > > >
> > > > > > I actually workaround this memory problem for the time being
by
> > > > > wrapping
> > > > > > the
> > > > > > IndexWriter in a class that periodically closes the IndexWriter
> > and
> > > > > creates
> > > > > > a new one, allowing the old to be GCed, but I would be really
> > good if
> > > > > > either
> > > > > > JFlex or Lucene can take care of this zzBuffer going berserk.
> > > > > >
> > > > > >
> > > > > > Again thanks for the quick response. /Rubén
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > >
> > https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@
> > > > > web38901.mail.mud.yahoo.com
> > > > > >
> > > > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera <serera@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > If we could change the Flex file so that yyreset(Reader)
> > would
> > > > > check the
> > > > > > > size of zzBuffer, we could trim it when it gets too big.
But
> > I
> > > > > don't
> > > > > > think
> > > > > > > we have such control when writing the flex syntax ... yyreset
> > is
> > > > > > generated
> > > > > > > by JFlex and that's the only place I can think of to trim
the
> > > > > buffer down
> > > > > > > when it exceeds a predefined threshold ....
> > > > > > >
> > > > > > > Maybe what we can do is create our own method which will
be
> > called
> > > > > by
> > > > > > > StandardTokenizer after yyreset is called, something like
> > > > > > > trimBufferIfTooBig(int threshold) which will reallocate
> > zzBuffer if
> > > > > it
> > > > > > > exceeded the threshold. We can decide on a reasonable 64K
> > threshold
> > > > > or
> > > > > > > something, or simply always cut back to 16 KB. As far as
I
> > > > > understand,
> > > > > > that
> > > > > > > buffer should never grow that much. I.e. in zzRefill, which
> > is the
> > > > > only
> > > > > > > place where the buffer gets resized, there is an attempt
to
> > first
> > > > > move
> > > > > > back
> > > > > > > characters that were already consumed and only then allocate
> > a
> > > > > bigger
> > > > > > > buffer. Which means only if there is a token whose size
is
> > larger
> > > > > than
> > > > > > 16KB
> > > > > > > (!?), will this buffer get expanded.
> > > > > > >
> > > > > > > A trimBuffer method might not be that bad .. as a protective
> > > > > measure.
> > > > > > What
> > > > > > > do you think? Of course, JFlex can fix it on their own
...
> > but
> > > > > until that
> > > > > > > happens ...
> > > > > > >
> > > > > > > Shai
> > > > > > >
> > > > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler
> > <uwe@thetaphi.de>
> > > > > wrote:
> > > > > > >
> > > > > > > > > I would like to identify also the problematic
document I
> > have
> > > > > 10000
> > > > > > so,
> > > > > > > > > what
> > > > > > > > > would be the best way of identifying the one
that it
> > making
> > > > > zzBuffer
> > > > > > to
> > > > > > > > > grow
> > > > > > > > > without control?
> > > > > > > >
> > > > > > > > Dont index your documents, but instead pass them directly
> > to the
> > > > > > analyzer
> > > > > > > > and consume the tokenstream manually. Then visit
> > > > > > > TermAttribute.termLength()
> > > > > > > > for each Token.
> > > > > > > >
> > > > > > > >
> > > > > > > > -----------------------------------------------------------
> > ------
> > > > > ----
> > > > > > > > To unsubscribe, e-mail: java-user-
> > unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > /Rubén
> > > > > >
> > > >
> > > >
> > > > -------------------------------------------------------------------
> > --
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > /Rubén
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
/Rubén

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message