lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: IndexWriter memory leak?
Date Thu, 08 Apr 2010 07:33:49 GMT
Hi Ruben,

as Shai already pointed out, the buffer with this large size is hold by "StandardTokenizer",
which is used in the "StandardAnalyzer". This code is out of Lucene's control, as it is generated
by the jFlex library.

As long as the IndexWriter instance is living, the buffer is hold implicitely by the Analyzer.
If you want to shrink the buffer, file a bug against JFlex. For reuse, StandardAnalyzer holds
the Tokenizer instance as long as it lives. As you pass the Analyzer to the IndexWriter it
will only be GCed when IndexWriter is GCed.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Ruben Laguna [mailto:ruben.laguna@gmail.com]
> Sent: Thursday, April 08, 2010 9:24 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter memory leak?
> 
> I'm using StandardAnalyzer.
> 
> I indeed parse large documents, xml and pdfs, using nekohtml and tika
> respectively.
> 
> I took a look to the zzBuffer value contents (by exporting it to a file
> with
> Eclipse MAT from the heapdump) and it seems to contain normal text from
> several documents. See below
> 
> cat heapdumps/zzBuffer/value|iconv -f UTF-16 -t UTF-8|head
> 
> On the NetBeans Platform Build Harness
> «/on-the-netbeans-platform-build-harness/atform Build Harness
> 
>    Abstract
>   The NetBeans build harness performs a number of tasks when building a
> NetBeans platform application. This article describes some of the
> issues
> that can be encountered when building platform applications in other
> ways
> than the standard use case of creating and building it on a single PC
> in the
> NetBeans IDE. It also provides some tips for customising build
> behaviour.
> 
> 
> but it's difficult to tell if there is also a big token in the file due
> to
> the fact that I have no easy way to visualize it. (iconv is skipping
> things
> ). (hexdump -C doesn't really help to see the contents).
> 
> Do you know of anyway of actually indentifying the offending token or
> to
> list the tokens in the zzBuffer in an human readable way?
> 
> I would like to identify also the problematic document I have 10000 so,
> what
> would be the best way of identifying the one that it making zzBuffer to
> grow
> without control?
> 
> Best regards/Ruben
> 
> On Thu, Apr 8, 2010 at 8:41 AM, Shai Erera <serera@gmail.com> wrote:
> 
> > What Analyzer are you using? zzBuffer belongs to the tokenizer's
> automaton
> > that is generated by JFlex. I've checked StandardTokenizerImpl and
> zzBuffer
> > can grow, beyond the default 16KB, but yours look to be a lot bigger
> (33 MB
> > !?). The only explanation I have to this is that you're trying to (or
> in
> > the
> > process have tried to) parse a very long file, which its text
> couldn't be
> > broken down to small tokens (maybe due to a lack of spaces or
> something)
> > ...
> >
> > It'd be good if you tell us the Analyzer you use as well as if you
> can
> > identify a problematic document in your collection which can cause
> the
> > buffer to grow that much.
> >
> > Shai
> >
> > On Thu, Apr 8, 2010 at 1:23 AM, Ruben Laguna <ruben.laguna@gmail.com>
> > wrote:
> >
> > > I want to add that it tried this in both 2.9.0 and 3.0.1 and I got
> the
> > same
> > > "leaky" behavior.
> > >
> > > See [3] for a screenshot of zzBuffer of 67MB on Lucene 3.0.1.
> > >
> > > I managed to get rid of the Reader's "memory leak" by actually
> setting to
> > > null the pointer to the actual Tika's Reader in my wrapper when the
> > wrapper
> > > is closed. But I still think that I would be nicer if IndexWriter
> > wouldn't
> > > maintain references to the Readers after indexing.
> > >
> > > [3] http://img.skitch.com/20100407-ntn2kg13fx49wx4q118bp9h1hb.jpg
> > >
> > >
> > > On Wed, Apr 7, 2010 at 10:35 PM, Ruben Laguna
> <ruben.laguna@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > It seems like my IndexWriter after commiting and optimizing has a
> > > retained
> > > > size of 140Mb. See [1] for a screenshot of the heapdump analysis
> done
> > > with
> > > > Eclipse MAT.
> > > >
> > > > Of those 140MB 67MB are retained by
> > > >
> > >
> >
> analyzer.tokenStreams.hardRefs.table.HashMap$Entry.value.tokenStream.sc
> anner.zzBuffer
> > > >
> > > >
> > > > why is this? Is it a memory leak? or did I something wrong during
> the
> > > > indxing? (BTW, I'm indexing document which contains
> Fields(xxxx,Reader)
> > > and
> > > > those Reader are wrappers around Tika.parse(xxxx) Readers. I get
> a lot
> > > > IOExceptions from tika readers and the wrapper maps the
> exceptions to
> > EOF
> > > so
> > > > Lucene doesn't see the exception).
> > > >
> > > >
> > > >
> > > > ...and 73MB of the 140MB are retained by docWriter see [2]. It
> looks
> > like
> > > > the Field objects in the
> > > > array
> docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx] are
> > > > holding references to the Readers. Those reader instances are
> actually
> > > > closed after IndexWriter.updateDocument. Each one of those
> Readers
> > > retains
> > > > 1MB. The question is why IndexWriter holds references to those
> Readers
> > > after
> > > > the Documents have been indexed.
> > > >
> > > >
> > > > [1] http://img.skitch.com/20100407-1183815yiausisg73u9wfgscsj.jpg
> > > > [2] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg
> > > >
> > > > --
> > > > /Rubén
> > > >
> > >
> > >
> > >
> > > --
> > > /Rubén
> > >
> >
> 
> 
> 
> --
> /Rubén


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message