lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Taurat" <daniel.tau...@gaussvip.com>
Subject Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Date Mon, 13 Sep 2004 13:19:47 GMT
Okay,  reference test is done:
on JDK 1.4.2 Lucene 1.4.1 really seems to run fine: just a moderate 
number of SegmentTermEnums that is controlled by gc (about 500 for the 
1900 test objects).


Daniel Taurat wrote:

> Hi Doug,
> you are absolutely right about the older version of the JDK: it is 
> 1.3.1 (ibm).
> Unfortunately we cannot upgrade since we are bound to IBM Portalserver 
> 4 environment.
> Results:
> I patched the Lucene1.4.1:
> it has improved not much: after indexing 1897 Objects  the number of 
> SegmentTermEnum is up to 17936.
> To be realistic: This is even a deterioration :(((
> My next check will be with a JDK1.4.2 for the test environment, but 
> this can only be a reference run for now.
>
> Thanks,
> Daniel
>
> Doug Cutting wrote:
>
>> It sounds like the ThreadLocal in TermInfosReader is not getting 
>> correctly garbage collected when the TermInfosReader is collected. 
>> Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess 
>> is that you're running in an older JVM.  Is that right?
>
>
>>
>> I've attached a patch which should fix this.  Please tell me if it 
>> works for you.
>>
>> Doug
>>
>> Daniel Taurat wrote:
>>
>>> Okay, that (1.4rc3)worked fine, too!
>>> Got only 257 SegmentTermEnums for 1900 objects.
>>>
>>> Now I will go for the final test on the production server with the 
>>> 1.4rc3 version  and about 40.000 objects.
>>>
>>> Daniel
>>>
>>> Daniel Taurat schrieb:
>>>
>>>> Hi all,
>>>> here is some update for you:
>>>> I switched back to Lucene 1.3-final and now the  number of the  
>>>> SegmentTermEnum objects is controlled by gc again:
>>>> it goes up to about 1000 and then it is down again to 254 after 
>>>> indexing my 1900 test-objects.
>>>> Stay tuned, I will try 1.4RC3 now, the last version before 
>>>> FieldCache was introduced...
>>>>
>>>> Daniel
>>>>
>>>>
>>>> Rupinder Singh Mazara schrieb:
>>>>
>>>>> hi all
>>>>>  I had a similar problem, i have  database of documents with 24 
>>>>> fields, and a average content of 7K, with  16M+ records
>>>>>
>>>>>  i had to split the jobs into slabs of 1M each and merging the 
>>>>> resulting indexes, submissions to our job queue looked like
>>>>>
>>>>>  java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
>>>>>  
>>>>> and i still had outofmemory exception , the solution that i 
>>>>> created was to after every 200K, documents create a temp 
>>>>> directory, and merge them together, this was done to do the first 
>>>>> production run, updates are now being handled incrementally
>>>>>
>>>>>  
>>>>>
>>>>> Exception in thread "main" java.lang.OutOfMemoryError
>>>>> at 
>>>>> org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled

>>>>> Code))
>>>>>     at 
>>>>> org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined

>>>>> Compiled Code))
>>>>>     at 
>>>>> org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined

>>>>> Compiled Code))
>>>>>     at 
>>>>> org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled

>>>>> Code))
>>>>>     at 
>>>>> org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled

>>>>> Code))
>>>>>     at 
>>>>> org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled

>>>>> Code))
>>>>>     at 
>>>>> org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled

>>>>> Code))
>>>>>     at 
>>>>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled

>>>>> Code))
>>>>>     at 
>>>>> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled

>>>>> Code))
>>>>>     at 
>>>>> org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
>>>>>     at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
>>>>>     at lucene.Indexer.main(CDBIndexer.java:168)
>>>>>
>>>>>  
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Daniel Taurat [mailto:daniel.taurat@gaussvip.com]
>>>>>> Sent: 10 September 2004 14:42
>>>>>> To: Lucene Users List
>>>>>> Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large

>>>>>> number
>>>>>> of documents
>>>>>>
>>>>>>
>>>>>> Hi Pete,
>>>>>> good hint, but we actually do have physical memory of  4Gb on the

>>>>>> system. But then: we also have experienced that the gc of ibm 
>>>>>> jdk1.3.1 that we use is sometimes
>>>>>> behaving strangely with too large heap space anyway. (Limit seems

>>>>>> to be 1.2 Gb)
>>>>>> I can say that gc is not collecting these objects since I  forced

>>>>>> gc runs when indexing every now and then (when parsing pdf-type 
>>>>>> objects, that is): No effect.
>>>>>>
>>>>>> regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> Pete Lewis wrote:
>>>>>>
>>>>>>  
>>>>>>
>>>>>>> Hi all
>>>>>>>
>>>>>>> Reading the thread with interest, there is another way I've 
>>>>>>> come     
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> across out
>>>>>>  
>>>>>>
>>>>>>> of memory errors when indexing large batches of documents.
>>>>>>>
>>>>>>> If you have your heap space settings too high, then you get 
   
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> swapping (which
>>>>>>  
>>>>>>
>>>>>>> impacts performance) plus you never reach the trigger for garbage
>>>>>>> collection, hence you don't garbage collect and hence you run

>>>>>>> out     
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> of memory.
>>>>>>  
>>>>>>
>>>>>>> Can you check whether or not your garbage collection is being

>>>>>>> triggered?
>>>>>>>
>>>>>>> Anomalously therefore if this is the case, by reducing the heap

>>>>>>> space you
>>>>>>> can improve performance get rid of the out of memory errors.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Pete Lewis
>>>>>>>
>>>>>>> ----- Original Message ----- From: "Daniel Taurat" 
>>>>>>> <daniel.taurat@gaussvip.com>
>>>>>>> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>>>>>>> Sent: Friday, September 10, 2004 1:10 PM
>>>>>>> Subject: Re: Out of memory in lucene 1.4.1 when re-indexing 
>>>>>>> large     
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> number of
>>>>>>  
>>>>>>
>>>>>>> documents
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>> Daniel Aber schrieb:
>>>>>>>>
>>>>>>>>  
>>>>>>>>    
>>>>>>>>
>>>>>>>>> On Thursday 09 September 2004 19:47, Daniel Taurat wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         
>>>>>>>>>
>>>>>>>>>> I am facing an out of memory problem using  Lucene
1.4.1.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Could you try with a recent CVS version? There has been
a fix 
>>>>>>>>>         
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> about files
>>>>>>  
>>>>>>
>>>>>>>>> not being deleted after 1.4.1. Not sure if that could
cause 
>>>>>>>>> the problems
>>>>>>>>> you're experiencing.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Daniel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>            
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Well, it seems not to be files, it looks more like those

>>>>>>>> SegmentTermEnum
>>>>>>>> objects accumulating in memory.
>>>>>>>> #I've seen some discussion on these objects in the 
>>>>>>>> developer-newsgroup
>>>>>>>> that had taken place some time ago.
>>>>>>>> I am afraid this is some kind of runaway caching I have to
deal 
>>>>>>>> with.
>>>>>>>> Maybe not  correctly addressed in this newsgroup, after all...
>>>>>>>>
>>>>>>>> Anyway: any idea if there is an API command to re-init caches?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------

>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>>>>> For additional commands, e-mail: 
>>>>>>>> lucene-user-help@jakarta.apache.org
>>>>>>>>
>>>>>>>>  
>>>>>>>>       
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------

>>>>>>>
>>>>>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>>>> For additional commands, e-mail: 
>>>>>>> lucene-user-help@jakarta.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------

>>>>>>
>>>>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>>>
>>>>>>
>>>>>>   
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>>
>>>>>
>>>>>  
>>>>>
>>>>
>>>>
>>>
>>>
>> ------------------------------------------------------------------------
>>
>> Index: src/java/org/apache/lucene/index/TermInfosReader.java
>> ===================================================================
>> RCS file: 
>> /home/cvs/jakarta-lucene/src/java/org/apache/lucene/index/TermInfosReader.java,v

>>
>> retrieving revision 1.9
>> diff -u -r1.9 TermInfosReader.java
>> --- src/java/org/apache/lucene/index/TermInfosReader.java    6 Aug 
>> 2004 20:50:29 -0000    1.9
>> +++ src/java/org/apache/lucene/index/TermInfosReader.java    10 Sep 
>> 2004 17:46:47 -0000
>> @@ -45,6 +45,11 @@
>>     readIndex();
>>   }
>>
>> +  protected final void finalize() {
>> +    // patch for pre-1.4.2 JVMs, whose ThreadLocals leak
>> +    enumerators.set(null);
>> +  }
>> +
>>   public int getSkipInterval() {
>>     return origEnum.skipInterval;
>>   }
>>
>>  
>>
>> ------------------------------------------------------------------------
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message