lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: IndexWriter memory leak?
Date Thu, 08 Apr 2010 10:11:00 GMT
Guess we were replying at the same time :).

On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> I already answered, that I will take care of this!
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Shai Erera [mailto:serera@gmail.com]
> > Sent: Thursday, April 08, 2010 12:00 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: IndexWriter memory leak?
> >
> > Yes, that's the trimBuffer version I was thinking about, only this guy
> > created a reset(Reader, int) and does both ops (resetting + trim) in
> > one
> > method call. More convenient. Can you please open an issue to track
> > that?
> > People will have a chance to comment on whether we (Lucene) should
> > handle
> > that, or it should be a JFlex fix. Based on the number of replies this
> > guy
> > received (0 !), I doubt JFlex would consider it a problem. But we can
> > do
> > some small service to our users base by protecting against such
> > problems.
> >
> > And while you're opening the issue, if you want to take a stab at
> > fixing it
> > and post a patch, it'd be great :).
> >
> > Shai
> >
> > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna
> > <ruben.laguna@gmail.com>wrote:
> >
> > > I was investigating this a little further and in the JFlex mailing
> > list I
> > > found [1]
> > >
> > > I don't know much about flex / JFlex but it seems that this guy
> > resets the
> > > zzBuffer to 16384 or less when setting the input for the lexer
> > >
> > >
> > > Quoted from  shef <shef31@ya...>
> > >
> > >
> > > I set
> > >
> > > %buffer 0
> > >
> > > in the options section, and then added this method to the lexer:
> > >
> > >    /**
> > >     * Set the input for the lexer. The size parameter really speeds
> > things
> > > up,
> > >     * because by default, the lexer allocates an internal buffer of
> > 16k.
> > > For
> > >     * most strings, this is unnecessarily large. If the size param is
> > > 0 or greater
> > >     * than 16k, then the buffer is set to 16k. If the size param is
> > > smaller, then
> > >     * the buf will be set to the exact size.
> > >     * @param r the reader that provides the data
> > >     * @param the size of the data in the reader.
> > >     */
> > >    public void reset(Reader r, int size) {
> > >        if (size == 0 || size > 16384)
> > >            size = 16384;
> > >        zzBuffer = new char[size];
> > >        yyreset(r);
> > >    }
> > >
> > >
> > > So maybe there is a way to trim the zzBuffer this way (?).
> > >
> > > BTW, I will try to find out which is the "big token" in my dataset
> > this
> > > afternoon. Thanks for the help.
> > >
> > > I actually workaround this memory problem for the time being by
> > wrapping
> > > the
> > > IndexWriter in a class that periodically closes the IndexWriter and
> > creates
> > > a new one, allowing the old to be GCed, but I would be really good if
> > > either
> > > JFlex or Lucene can take care of this zzBuffer going berserk.
> > >
> > >
> > > Again thanks for the quick response. /Rubén
> > >
> > >
> > > [1]
> > >
> > >
> > https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@
> > web38901.mail.mud.yahoo.com
> > >
> > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera <serera@gmail.com> wrote:
> > >
> > > > If we could change the Flex file so that yyreset(Reader) would
> > check the
> > > > size of zzBuffer, we could trim it when it gets too big. But I
> > don't
> > > think
> > > > we have such control when writing the flex syntax ... yyreset is
> > > generated
> > > > by JFlex and that's the only place I can think of to trim the
> > buffer down
> > > > when it exceeds a predefined threshold ....
> > > >
> > > > Maybe what we can do is create our own method which will be called
> > by
> > > > StandardTokenizer after yyreset is called, something like
> > > > trimBufferIfTooBig(int threshold) which will reallocate zzBuffer if
> > it
> > > > exceeded the threshold. We can decide on a reasonable 64K threshold
> > or
> > > > something, or simply always cut back to 16 KB. As far as I
> > understand,
> > > that
> > > > buffer should never grow that much. I.e. in zzRefill, which is the
> > only
> > > > place where the buffer gets resized, there is an attempt to first
> > move
> > > back
> > > > characters that were already consumed and only then allocate a
> > bigger
> > > > buffer. Which means only if there is a token whose size is larger
> > than
> > > 16KB
> > > > (!?), will this buffer get expanded.
> > > >
> > > > A trimBuffer method might not be that bad .. as a protective
> > measure.
> > > What
> > > > do you think? Of course, JFlex can fix it on their own ... but
> > until that
> > > > happens ...
> > > >
> > > > Shai
> > > >
> > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler <uwe@thetaphi.de>
> > wrote:
> > > >
> > > > > > I would like to identify also the problematic document I have
> > 10000
> > > so,
> > > > > > what
> > > > > > would be the best way of identifying the one that it making
> > zzBuffer
> > > to
> > > > > > grow
> > > > > > without control?
> > > > >
> > > > > Dont index your documents, but instead pass them directly to the
> > > analyzer
> > > > > and consume the tokenstream manually. Then visit
> > > > TermAttribute.termLength()
> > > > > for each Token.
> > > > >
> > > > >
> > > > > -----------------------------------------------------------------
> > ----
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > /Rubén
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message