lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Herbert L Roitblat <h...@orcatec.com>
Subject Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Date Sun, 11 Apr 2010 17:28:40 GMT
Hi, Folks.  Thanks, Ruben, for your help.  It let me get a ways down the 
road.

The problem is the the heap is filling up when I am doing a 
lucene.TermQuery.  What I am trying to accomplish is to get the terms in 
one field of each document and their frequency in the document.  A code 
snippet is attached below. It yields the results I want.

I managed to get a small enough heap dump into jhat.  Now I could use 
some help understanding what I have found and some help figuring out 
what to do about it.  I am a noobi at understanding the details of 
Lucene, pyLucene,  and Java debugging.

If I understand correctly, the heap is filling up because it is keeping 
instances of objects around after there is no longer any need for them. 
I thought that it might be the case that Python was somehow keeping them 
around, but that does not seem to be the case (true?).

 From jhat, I got a class instance histogram:

290163 instances <http://192.168.1.155:7000/instances/0x7fbf693bb990> of 
class org.apache.lucene.index.TermInfo 
<http://192.168.1.155:7000/class/0x7fbf693bb990>
289988 instances <http://192.168.1.155:7000/instances/0x7fbf69412d80> of 
class org.apache.lucene.index.Term 
<http://192.168.1.155:7000/class/0x7fbf69412d80>
1976 instances <http://192.168.1.155:7000/instances/0x7fbf693f1300> of 
class org.apache.lucene.index.FieldInfo 
<http://192.168.1.155:7000/class/0x7fbf693f1300>
1976 instances <http://192.168.1.155:7000/instances/0x7fbf6940a1a8> of 
class org.apache.lucene.index.SegmentReader$Norm 
<http://192.168.1.155:7000/class/0x7fbf6940a1a8>
1081 instances <http://192.168.1.155:7000/instances/0x7fbf6928d460> of 
class org.apache.lucene.store.FSDirectory$FSIndexInput 
<http://192.168.1.155:7000/class/0x7fbf6928d460>
1048 instances <http://192.168.1.155:7000/instances/0x7fbf693ef958> of 
class org.apache.lucene.index.CompoundFileReader$CSIndexInput 
<http://192.168.1.155:7000/class/0x7fbf693ef958>
540 instances <http://192.168.1.155:7000/instances/0x7fbf69400510> of 
class org.apache.lucene.index.TermBuffer 
<http://192.168.1.155:7000/class/0x7fbf69400510>
540 instances <http://192.168.1.155:7000/instances/0x7fbf694011c8> of 
class org.apache.lucene.util.UnicodeUtil$UTF16Result 
<http://192.168.1.155:7000/class/0x7fbf694011c8>
540 instances <http://192.168.1.155:7000/instances/0x7fbf693bc168> of 
class org.apache.lucene.util.UnicodeUtil$UTF8Result 
<http://192.168.1.155:7000/class/0x7fbf693bc168>
...

There are way too many instance of index.TermInfo and index.indexTerm.  
So, I tracked down some instances and looked for rootset references.  
There were none.  If I understand correctly, this instance should be 
garbage collected if there are no rootset references.  True?

     Here's an example from jhat:

    Rootset references to 
org.apache.lucene.index.TermInfo@0x7fbf6e3f8218 (includes weak refs)

    References to org.apache.lucene.index.TermInfo@0x7fbf6e3f8218 (40 bytes)
    Other queries
    Exclude weak refs
---
There is at least one reference to the object, it is an element in an 
array, but the array does not have rootset references either.

Am I misinterpreting these results?  In any case, what can I do about 
getting rid of these?  Is it a bug in this version of Lucene?  Is there 
a known fix?  I think that I should be able to do an unlimited number of 
queries without filling up the heap.
I am using pyLucene version 2.4.

Thanks for your help.

Herb

-------------------------------
Code snippet:
        reader = self.index.getReader()
        lReader = reader.get()
        searcher = self.index.getSearcher()
        lSearcher = searcher.get()
        query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID, uid))
        hits = list(lSearcher.search(query))
        if hits:
            hit = lucene.Hit.cast_(hits[0])
            tfvs = lReader.getTermFreqVectors(hit.id)

            if tfvs is not None: # this happens if the vector is not stored
                for tfv in tfvs: # There's one for each field that has a 
TermFreqVector
                    tfvP = lucene.TermFreqVector.cast_(tfv)
                    if returnAllFields or tfvP.field in termFields: # 
add only asked fields
                        tFields[tfvP.field] = dict([(t,f) for (t,f) in 
zip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq])
        else:
            # This shouldn't happen, but we just log the error and march on
            self.log.error("Unable to fetch doc %s from index"%(uid))
##        if self.opCount % 1000 == 0:
##            print lucene.JCCEnv._dumpRefs(classes=True).items() 
#http://lists.osafoundation.org/pipermail/pylucene-dev/2008-January/002171.html
##        self.opCount += 1

        lReader.close()
        lSearcher.close()
        retFields = copy.deepcopy(tFields) #return a copy of tFields to 
free up references to it and its contents



Herbert Roitblat wrote:
> Hi, folks.
> I am using PyLucene and doing a lot of get tokens.  lucene.py reports version 2.4.0.
 It is rpath linux with 8GB of memory.  Python is 2.4.
> I'm not sure what the maxheap is, I think that it is maxheap='2048m'.  I think that it's
running in a 64 bit environment.
> It indexes a set of 116,000 documents just fine.
> Then I need to get the tokens from these documents and near the end, I run into:
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> If I wait a bit and ask again for the same document's tokens, I can get them, but it
then is somewhat likely to post the same error on a certain number of other documents.  I
can handle these errors and ask again.
>
> I have read that this error message means that the heap is getting filled up and garbage
collection removes only a small amount of it.  Since all I am doing is retrieving, why should
the heap be filling up?  I restarted the system before starting the retrieval.
>
> My guess is that there is some small memory leak because memory assigned to my python
program grows slowly as I request more document tokens.  Since I'm not intending to change
anything in either my python program or in Lucene, any growth is unintentional.  I'm just
getting tokens.
>
>  we use lucene.TermQuery as the query object to get the terms.
>
> I cannot share the documents nor the application code, but I might be able to provide
snippets.
>
> One last piece of information, the time needed to retrieve documents slows throughout
the process.  In the beginning I was getting about 10 documents per second.  Towards the end,
it is down to about 5 with about 5 second pauses from time to time, perhaps due to garbage
collection?
>
> Any idea of why the heap is filling up and what I can do about it?
>
> Thanks,
> Herb
>
>
>
>   

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message