lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: IndexWriter memory leak?
Date Thu, 08 Apr 2010 06:41:23 GMT
What Analyzer are you using? zzBuffer belongs to the tokenizer's automaton
that is generated by JFlex. I've checked StandardTokenizerImpl and zzBuffer
can grow, beyond the default 16KB, but yours look to be a lot bigger (33 MB
!?). The only explanation I have to this is that you're trying to (or in the
process have tried to) parse a very long file, which its text couldn't be
broken down to small tokens (maybe due to a lack of spaces or something) ...

It'd be good if you tell us the Analyzer you use as well as if you can
identify a problematic document in your collection which can cause the
buffer to grow that much.

Shai

On Thu, Apr 8, 2010 at 1:23 AM, Ruben Laguna <ruben.laguna@gmail.com> wrote:

> I want to add that it tried this in both 2.9.0 and 3.0.1 and I got the same
> "leaky" behavior.
>
> See [3] for a screenshot of zzBuffer of 67MB on Lucene 3.0.1.
>
> I managed to get rid of the Reader's "memory leak" by actually setting to
> null the pointer to the actual Tika's Reader in my wrapper when the wrapper
> is closed. But I still think that I would be nicer if IndexWriter wouldn't
> maintain references to the Readers after indexing.
>
> [3] http://img.skitch.com/20100407-ntn2kg13fx49wx4q118bp9h1hb.jpg
>
>
> On Wed, Apr 7, 2010 at 10:35 PM, Ruben Laguna <ruben.laguna@gmail.com
> >wrote:
>
> > Hi,
> >
> > It seems like my IndexWriter after commiting and optimizing has a
> retained
> > size of 140Mb. See [1] for a screenshot of the heapdump analysis done
> with
> > Eclipse MAT.
> >
> > Of those 140MB 67MB are retained by
> >
> analyzer.tokenStreams.hardRefs.table.HashMap$Entry.value.tokenStream.scanner.zzBuffer
> >
> >
> > why is this? Is it a memory leak? or did I something wrong during the
> > indxing? (BTW, I'm indexing document which contains Fields(xxxx,Reader)
> and
> > those Reader are wrappers around Tika.parse(xxxx) Readers. I get a lot
> > IOExceptions from tika readers and the wrapper maps the exceptions to EOF
> so
> > Lucene doesn't see the exception).
> >
> >
> >
> > ...and 73MB of the 140MB are retained by docWriter see [2]. It looks like
> > the Field objects in the
> > array docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx] are
> > holding references to the Readers. Those reader instances are actually
> > closed after IndexWriter.updateDocument. Each one of those Readers
> retains
> > 1MB. The question is why IndexWriter holds references to those Readers
> after
> > the Documents have been indexed.
> >
> >
> > [1] http://img.skitch.com/20100407-1183815yiausisg73u9wfgscsj.jpg
> > [2] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg
> >
> > --
> > /Rubén
> >
>
>
>
> --
> /Rubén
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message