lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <karl.wri...@nokia.com>
Subject RE: Lucene 4.0 memory usage during indexing - is this expected?
Date Wed, 03 Oct 2012 16:40:34 GMT
Threads are managed via an executor service and are a fixed size thread pool, of size 16 on
this machine.

There are not a lot of fields in the schema (a half dozen).  We do use PerFieldAnalyzerWrapper.

I'm still grappling with the mat reports; it's possible of course that we're holding onto
something unexpected, or even that we have a fragmentation situation.  Stay tuned.

Karl

-----Original Message-----
From: ext Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Wednesday, October 03, 2012 11:50 AM
To: dev@lucene.apache.org
Subject: Re: Lucene 4.0 memory usage during indexing - is this expected?

I wish I could remember/find the Jira issue here ... there was one fairly recently.

Are you really sure your not turning over threads that are coming through Lucene...?  High
thread turnover causes challenges for ThreadLocals ...

Do you have a lot of fields?  Are you using PerFieldAnalyzerWrapper...?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Oct 3, 2012 at 10:45 AM,  <karl.wright@nokia.com> wrote:
> There's a fixed-sized thread pool involved in doing the indexing, of a size that depends
on the machine parameters.
> Karl
>
> -----Original Message-----
> From: ext Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Wednesday, October 03, 2012 10:43 AM
> To: Wright Karl (Nokia-LC/Boston)
> Subject: Re: Lucene 4.0 memory usage during indexing - is this expected?
>
> This is no good!
>
> Can you send an email to dev@?  This sounds very familiar ... and I had thought we committed
a fix for it ... hopefully Uwe or Robert can remember what it was!
>
> Do you create new threads frequently, to do indexing?  Rather than pulling from a fixed
pool?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Oct 3, 2012 at 8:32 AM,  <karl.wright@nokia.com> wrote:
>> Hi Mike,
>>
>>
>>
>> I've got a technical question for you.
>>
>>
>>
>> For background, we've been building a new address search engine on 
>> top of Lucene 4.0.  The main customization involves a chain of custom 
>> analyzers etc, and it all works quite well.  Or at least it did until 
>> I added 7m more documents to the list.  At that point the indexing 
>> process began to run out of memory, even though we were giving it 
>> some 20GB.  Only some 12GB of that is accounted for in our part of the world.
>>
>>
>>
>> Looking at an eclipse MAT dump, the main thing that still seems to 
>> grow over time is/are TokenStreamComponent objects that are being 
>> held indirectly by org.apache.lucene.index.FieldInvertState objects.  
>> The number of FieldInvertState objects grows and grows.  By the 
>> middle of the indexing process, there are 30 of these, and each one 
>> of these seems to hold onto one TokenStreamComponent per field.  
>> (Each TokenStreamComponent in turn holds onto a whole pile of things 
>> like ICU tokenizers etc, so there's a strong multiplicative factor 
>> involved, which in the end winds up holding about 10GB of memory for 
>> those 30 objects.)
>>
>>
>>
>> The question: Why does the number of FieldInvertState objects grow 
>> over time during indexing?  Are these associated in some way with 
>> segments?  Is this expected behavior?
>>
>>
>>
>> Thanks!
>>
>> Karl
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For 
> additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail:
dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message