lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Herbert L Roitblat <>
Subject Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Date Sun, 11 Apr 2010 17:28:40 GMT
Hi, Folks.  Thanks, Ruben, for your help.  It let me get a ways down the 

The problem is the the heap is filling up when I am doing a 
lucene.TermQuery.  What I am trying to accomplish is to get the terms in 
one field of each document and their frequency in the document.  A code 
snippet is attached below. It yields the results I want.

I managed to get a small enough heap dump into jhat.  Now I could use 
some help understanding what I have found and some help figuring out 
what to do about it.  I am a noobi at understanding the details of 
Lucene, pyLucene,  and Java debugging.

If I understand correctly, the heap is filling up because it is keeping 
instances of objects around after there is no longer any need for them. 
I thought that it might be the case that Python was somehow keeping them 
around, but that does not seem to be the case (true?).

 From jhat, I got a class instance histogram:

290163 instances <> of 
class org.apache.lucene.index.TermInfo 
289988 instances <> of 
class org.apache.lucene.index.Term 
1976 instances <> of 
class org.apache.lucene.index.FieldInfo 
1976 instances <> of 
class org.apache.lucene.index.SegmentReader$Norm 
1081 instances <> of 
1048 instances <> of 
class org.apache.lucene.index.CompoundFileReader$CSIndexInput 
540 instances <> of 
class org.apache.lucene.index.TermBuffer 
540 instances <> of 
class org.apache.lucene.util.UnicodeUtil$UTF16Result 
540 instances <> of 
class org.apache.lucene.util.UnicodeUtil$UTF8Result 

There are way too many instance of index.TermInfo and index.indexTerm.  
So, I tracked down some instances and looked for rootset references.  
There were none.  If I understand correctly, this instance should be 
garbage collected if there are no rootset references.  True?

     Here's an example from jhat:

    Rootset references to 
org.apache.lucene.index.TermInfo@0x7fbf6e3f8218 (includes weak refs)

    References to org.apache.lucene.index.TermInfo@0x7fbf6e3f8218 (40 bytes)
    Other queries
    Exclude weak refs
There is at least one reference to the object, it is an element in an 
array, but the array does not have rootset references either.

Am I misinterpreting these results?  In any case, what can I do about 
getting rid of these?  Is it a bug in this version of Lucene?  Is there 
a known fix?  I think that I should be able to do an unlimited number of 
queries without filling up the heap.
I am using pyLucene version 2.4.

Thanks for your help.


Code snippet:
        reader = self.index.getReader()
        lReader = reader.get()
        searcher = self.index.getSearcher()
        lSearcher = searcher.get()
        query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID, uid))
        hits = list(
        if hits:
            hit = lucene.Hit.cast_(hits[0])
            tfvs = lReader.getTermFreqVectors(

            if tfvs is not None: # this happens if the vector is not stored
                for tfv in tfvs: # There's one for each field that has a 
                    tfvP = lucene.TermFreqVector.cast_(tfv)
                    if returnAllFields or tfvP.field in termFields: # 
add only asked fields
                        tFields[tfvP.field] = dict([(t,f) for (t,f) in 
zip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq])
            # This shouldn't happen, but we just log the error and march on
            self.log.error("Unable to fetch doc %s from index"%(uid))
##        if self.opCount % 1000 == 0:
##            print lucene.JCCEnv._dumpRefs(classes=True).items() 
##        self.opCount += 1

        retFields = copy.deepcopy(tFields) #return a copy of tFields to 
free up references to it and its contents

Herbert Roitblat wrote:
> Hi, folks.
> I am using PyLucene and doing a lot of get tokens. reports version 2.4.0.
 It is rpath linux with 8GB of memory.  Python is 2.4.
> I'm not sure what the maxheap is, I think that it is maxheap='2048m'.  I think that it's
running in a 64 bit environment.
> It indexes a set of 116,000 documents just fine.
> Then I need to get the tokens from these documents and near the end, I run into:
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> If I wait a bit and ask again for the same document's tokens, I can get them, but it
then is somewhat likely to post the same error on a certain number of other documents.  I
can handle these errors and ask again.
> I have read that this error message means that the heap is getting filled up and garbage
collection removes only a small amount of it.  Since all I am doing is retrieving, why should
the heap be filling up?  I restarted the system before starting the retrieval.
> My guess is that there is some small memory leak because memory assigned to my python
program grows slowly as I request more document tokens.  Since I'm not intending to change
anything in either my python program or in Lucene, any growth is unintentional.  I'm just
getting tokens.
>  we use lucene.TermQuery as the query object to get the terms.
> I cannot share the documents nor the application code, but I might be able to provide
> One last piece of information, the time needed to retrieve documents slows throughout
the process.  In the beginning I was getting about 10 documents per second.  Towards the end,
it is down to about 5 with about 5 second pauses from time to time, perhaps due to garbage
> Any idea of why the heap is filling up and what I can do about it?
> Thanks,
> Herb

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message