lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ruben Laguna <ruben.lag...@gmail.com>
Subject Re: IndexWriter memory leak?
Date Thu, 08 Apr 2010 07:23:36 GMT
I'm using StandardAnalyzer.

I indeed parse large documents, xml and pdfs, using nekohtml and tika
respectively.

I took a look to the zzBuffer value contents (by exporting it to a file with
Eclipse MAT from the heapdump) and it seems to contain normal text from
several documents. See below

cat heapdumps/zzBuffer/value|iconv -f UTF-16 -t UTF-8|head

On the NetBeans Platform Build Harness
«/on-the-netbeans-platform-build-harness/atform Build Harness

   Abstract
  The NetBeans build harness performs a number of tasks when building a
NetBeans platform application. This article describes some of the issues
that can be encountered when building platform applications in other ways
than the standard use case of creating and building it on a single PC in the
NetBeans IDE. It also provides some tips for customising build behaviour.


but it's difficult to tell if there is also a big token in the file due to
the fact that I have no easy way to visualize it. (iconv is skipping things
). (hexdump -C doesn't really help to see the contents).

Do you know of anyway of actually indentifying the offending token or to
list the tokens in the zzBuffer in an human readable way?

I would like to identify also the problematic document I have 10000 so, what
would be the best way of identifying the one that it making zzBuffer to grow
without control?

Best regards/Ruben

On Thu, Apr 8, 2010 at 8:41 AM, Shai Erera <serera@gmail.com> wrote:

> What Analyzer are you using? zzBuffer belongs to the tokenizer's automaton
> that is generated by JFlex. I've checked StandardTokenizerImpl and zzBuffer
> can grow, beyond the default 16KB, but yours look to be a lot bigger (33 MB
> !?). The only explanation I have to this is that you're trying to (or in
> the
> process have tried to) parse a very long file, which its text couldn't be
> broken down to small tokens (maybe due to a lack of spaces or something)
> ...
>
> It'd be good if you tell us the Analyzer you use as well as if you can
> identify a problematic document in your collection which can cause the
> buffer to grow that much.
>
> Shai
>
> On Thu, Apr 8, 2010 at 1:23 AM, Ruben Laguna <ruben.laguna@gmail.com>
> wrote:
>
> > I want to add that it tried this in both 2.9.0 and 3.0.1 and I got the
> same
> > "leaky" behavior.
> >
> > See [3] for a screenshot of zzBuffer of 67MB on Lucene 3.0.1.
> >
> > I managed to get rid of the Reader's "memory leak" by actually setting to
> > null the pointer to the actual Tika's Reader in my wrapper when the
> wrapper
> > is closed. But I still think that I would be nicer if IndexWriter
> wouldn't
> > maintain references to the Readers after indexing.
> >
> > [3] http://img.skitch.com/20100407-ntn2kg13fx49wx4q118bp9h1hb.jpg
> >
> >
> > On Wed, Apr 7, 2010 at 10:35 PM, Ruben Laguna <ruben.laguna@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > It seems like my IndexWriter after commiting and optimizing has a
> > retained
> > > size of 140Mb. See [1] for a screenshot of the heapdump analysis done
> > with
> > > Eclipse MAT.
> > >
> > > Of those 140MB 67MB are retained by
> > >
> >
> analyzer.tokenStreams.hardRefs.table.HashMap$Entry.value.tokenStream.scanner.zzBuffer
> > >
> > >
> > > why is this? Is it a memory leak? or did I something wrong during the
> > > indxing? (BTW, I'm indexing document which contains Fields(xxxx,Reader)
> > and
> > > those Reader are wrappers around Tika.parse(xxxx) Readers. I get a lot
> > > IOExceptions from tika readers and the wrapper maps the exceptions to
> EOF
> > so
> > > Lucene doesn't see the exception).
> > >
> > >
> > >
> > > ...and 73MB of the 140MB are retained by docWriter see [2]. It looks
> like
> > > the Field objects in the
> > > array docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx] are
> > > holding references to the Readers. Those reader instances are actually
> > > closed after IndexWriter.updateDocument. Each one of those Readers
> > retains
> > > 1MB. The question is why IndexWriter holds references to those Readers
> > after
> > > the Documents have been indexed.
> > >
> > >
> > > [1] http://img.skitch.com/20100407-1183815yiausisg73u9wfgscsj.jpg
> > > [2] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg
> > >
> > > --
> > > /Rubén
> > >
> >
> >
> >
> > --
> > /Rubén
> >
>



-- 
/Rubén

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message