lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herbert Roitblat" <h...@orcatec.com>
Subject Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Date Mon, 12 Apr 2010 14:54:53 GMT
Update:
reusing the reader and searcher made almost no difference.  It still eats up 
the heap.
----- Original Message ----- 
From: "Herbert L Roitblat" <herb@orcatec.com>
To: <java-user@lucene.apache.org>
Sent: Monday, April 12, 2010 6:50 AM
Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded


> Thank you Michael.  Your suggestions are helpful.  I inherited all of
> the code that uses pyLucene and don't consider myself an expert on it,
> so I very much appreciate your suggestions.
>
> It does not seem to be the case that these elements represent the index
> of the collection. TermInfo and Term grow as I retrieve more documents.
> There was no trouble building the index.
>
> The contents of these fields are the tokens (some fields are tokenized,
> others not) of the document fields.  In the tokenized fields, there is
> one object for each word. They seem to be in order of the documents for
> which the term vectors are being sought.  So these objects seem to
> represent a "concatenation" of all of the documents being considered in
> order, and if they are never removed, would always overwhelm the heap
> with a large document set.  They are not the index in the usual sense, I
> think.  Before I start retrieving documents, there is barely anything in
> these objects.
>
> What is holding the document contents in the heap after the fields
> information is returned?
>
> Can you say more about incRef/decRef?  I deleted all variables that
> interacted with Lucene and it seems to have made no difference
>
> There are not a lot of different fields, I would say on the order of 50
> with about 20 of them in virtually every document.
>
> It uses:
> lucene.IndexReader.open(self._store)
>
>
> One suggestion I got is to put the reader code in the class init
> function and then reuse it.  I have not tried that one yet (next on the
> agenda).  You suggested something similar and I will try that.
>
> Thanks,
> Herb
>
>
> Michael McCandless wrote:
>> The large count of TermInfo & Term is completely normal -- this is
>> Lucene's term index, which is entirely RAM resident.
>>
>> In 3.1, with flexible indexing, the RAM efficiency of the terms index
>> should be much improved.
>>
>> While opening a new reader/searcher for every query is horribly
>> inefficient, it should not leak memory.  (Are you using
>> IndexReader.reopen?  I see calls to getReader, but this Lucene API
>> (near-real-time search) wasn't added until 2.9, and you're on 2.4, so
>> I think that's your own method?).
>>
>> What do your get/getReader/getSearcher calls do?  Are you using
>> incRef/decRef at all to manage the lifetime of your readers?  How many
>> unique field names do you have, across all docs that you index?
>>
>> If you change your test to open a single reader, but run that
>> TermQuery over and over and over again, do you still hit OOME?
>>
>> Mike
>>
>> On Sun, Apr 11, 2010 at 1:28 PM, Herbert L Roitblat <herb@orcatec.com> 
>> wrote:
>>
>>> Hi, Folks.  Thanks, Ruben, for your help.  It let me get a ways down the
>>> road.
>>>
>>> The problem is the the heap is filling up when I am doing a
>>> lucene.TermQuery.  What I am trying to accomplish is to get the terms in 
>>> one
>>> field of each document and their frequency in the document.  A code 
>>> snippet
>>> is attached below. It yields the results I want.
>>>
>>> I managed to get a small enough heap dump into jhat.  Now I could use 
>>> some
>>> help understanding what I have found and some help figuring out what to 
>>> do
>>> about it.  I am a noobi at understanding the details of Lucene, 
>>> pyLucene,
>>>  and Java debugging.
>>>
>>> If I understand correctly, the heap is filling up because it is keeping
>>> instances of objects around after there is no longer any need for them. 
>>> I
>>> thought that it might be the case that Python was somehow keeping them
>>> around, but that does not seem to be the case (true?).
>>>
>>> From jhat, I got a class instance histogram:
>>>
>>> 290163 instances <http://192.168.1.155:7000/instances/0x7fbf693bb990> of
>>> class org.apache.lucene.index.TermInfo
>>> <http://192.168.1.155:7000/class/0x7fbf693bb990>
>>> 289988 instances <http://192.168.1.155:7000/instances/0x7fbf69412d80> of
>>> class org.apache.lucene.index.Term
>>> <http://192.168.1.155:7000/class/0x7fbf69412d80>
>>> 1976 instances <http://192.168.1.155:7000/instances/0x7fbf693f1300> of

>>> class
>>> org.apache.lucene.index.FieldInfo
>>> <http://192.168.1.155:7000/class/0x7fbf693f1300>
>>> 1976 instances <http://192.168.1.155:7000/instances/0x7fbf6940a1a8> of

>>> class
>>> org.apache.lucene.index.SegmentReader$Norm
>>> <http://192.168.1.155:7000/class/0x7fbf6940a1a8>
>>> 1081 instances <http://192.168.1.155:7000/instances/0x7fbf6928d460> of

>>> class
>>> org.apache.lucene.store.FSDirectory$FSIndexInput
>>> <http://192.168.1.155:7000/class/0x7fbf6928d460>
>>> 1048 instances <http://192.168.1.155:7000/instances/0x7fbf693ef958> of

>>> class
>>> org.apache.lucene.index.CompoundFileReader$CSIndexInput
>>> <http://192.168.1.155:7000/class/0x7fbf693ef958>
>>> 540 instances <http://192.168.1.155:7000/instances/0x7fbf69400510> of 
>>> class
>>> org.apache.lucene.index.TermBuffer
>>> <http://192.168.1.155:7000/class/0x7fbf69400510>
>>> 540 instances <http://192.168.1.155:7000/instances/0x7fbf694011c8> of 
>>> class
>>> org.apache.lucene.util.UnicodeUtil$UTF16Result
>>> <http://192.168.1.155:7000/class/0x7fbf694011c8>
>>> 540 instances <http://192.168.1.155:7000/instances/0x7fbf693bc168> of 
>>> class
>>> org.apache.lucene.util.UnicodeUtil$UTF8Result
>>> <http://192.168.1.155:7000/class/0x7fbf693bc168>
>>> ...
>>>
>>> There are way too many instance of index.TermInfo and index.indexTerm. 
>>> So,
>>> I tracked down some instances and looked for rootset references.  There 
>>> were
>>> none.  If I understand correctly, this instance should be garbage 
>>> collected
>>> if there are no rootset references.  True?
>>>
>>>    Here's an example from jhat:
>>>
>>>   Rootset references to org.apache.lucene.index.TermInfo@0x7fbf6e3f8218
>>> (includes weak refs)
>>>
>>>   References to org.apache.lucene.index.TermInfo@0x7fbf6e3f8218 (40 
>>> bytes)
>>>   Other queries
>>>   Exclude weak refs
>>> ---
>>> There is at least one reference to the object, it is an element in an 
>>> array,
>>> but the array does not have rootset references either.
>>>
>>> Am I misinterpreting these results?  In any case, what can I do about
>>> getting rid of these?  Is it a bug in this version of Lucene?  Is there 
>>> a
>>> known fix?  I think that I should be able to do an unlimited number of
>>> queries without filling up the heap.
>>> I am using pyLucene version 2.4.
>>>
>>> Thanks for your help.
>>>
>>> Herb
>>>
>>> -------------------------------
>>> Code snippet:
>>>       reader = self.index.getReader()
>>>       lReader = reader.get()
>>>       searcher = self.index.getSearcher()
>>>       lSearcher = searcher.get()
>>>       query = lucene.TermQuery(lucene.Term(OTDocument.UID_FIELD_ID, 
>>> uid))
>>>       hits = list(lSearcher.search(query))
>>>       if hits:
>>>           hit = lucene.Hit.cast_(hits[0])
>>>           tfvs = lReader.getTermFreqVectors(hit.id)
>>>
>>>           if tfvs is not None: # this happens if the vector is not 
>>> stored
>>>               for tfv in tfvs: # There's one for each field that has a
>>> TermFreqVector
>>>                   tfvP = lucene.TermFreqVector.cast_(tfv)
>>>                   if returnAllFields or tfvP.field in termFields: # add 
>>> only
>>> asked fields
>>>                       tFields[tfvP.field] = dict([(t,f) for (t,f) in
>>> zip(tfvP.getTerms(),tfvP.getTermFrequencies()) if f >=minFreq])
>>>       else:
>>>           # This shouldn't happen, but we just log the error and march 
>>> on
>>>           self.log.error("Unable to fetch doc %s from index"%(uid))
>>> ##        if self.opCount % 1000 == 0:
>>> ##            print lucene.JCCEnv._dumpRefs(classes=True).items()
>>> #http://lists.osafoundation.org/pipermail/pylucene-dev/2008-January/002171.html
>>> ##        self.opCount += 1
>>>
>>>       lReader.close()
>>>       lSearcher.close()
>>>       retFields = copy.deepcopy(tFields) #return a copy of tFields to 
>>> free
>>> up references to it and its contents
>>>
>>>
>>>
>>> Herbert Roitblat wrote:
>>>
>>>> Hi, folks.
>>>> I am using PyLucene and doing a lot of get tokens.  lucene.py reports
>>>> version 2.4.0.  It is rpath linux with 8GB of memory.  Python is 2.4.
>>>> I'm not sure what the maxheap is, I think that it is maxheap='2048m'. 
>>>> I
>>>> think that it's running in a 64 bit environment.
>>>> It indexes a set of 116,000 documents just fine.
>>>> Then I need to get the tokens from these documents and near the end, I 
>>>> run
>>>> into:
>>>>
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>
>>>> If I wait a bit and ask again for the same document's tokens, I can get
>>>> them, but it then is somewhat likely to post the same error on a 
>>>> certain
>>>> number of other documents.  I can handle these errors and ask again.
>>>>
>>>> I have read that this error message means that the heap is getting 
>>>> filled
>>>> up and garbage collection removes only a small amount of it.  Since all 
>>>> I am
>>>> doing is retrieving, why should the heap be filling up?  I restarted 
>>>> the
>>>> system before starting the retrieval.
>>>>
>>>> My guess is that there is some small memory leak because memory 
>>>> assigned
>>>> to my python program grows slowly as I request more document tokens. 
>>>> Since
>>>> I'm not intending to change anything in either my python program or in
>>>> Lucene, any growth is unintentional.  I'm just getting tokens.
>>>>
>>>>  we use lucene.TermQuery as the query object to get the terms.
>>>>
>>>> I cannot share the documents nor the application code, but I might be 
>>>> able
>>>> to provide snippets.
>>>>
>>>> One last piece of information, the time needed to retrieve documents 
>>>> slows
>>>> throughout the process.  In the beginning I was getting about 10 
>>>> documents
>>>> per second.  Towards the end, it is down to about 5 with about 5 second
>>>> pauses from time to time, perhaps due to garbage collection?
>>>>
>>>> Any idea of why the heap is filling up and what I can do about it?
>>>>
>>>> Thanks,
>>>> Herb
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message