clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <>
Subject Re: Is Clerezza leaking memory?
Date Fri, 29 Nov 2013 17:15:27 GMT
On 29/11/13 12:31, Minto van der Sluis wrote:
> Andy Seaborne schreef op 29-11-2013 9:39:
>> On 28/11/13 13:17, Minto van der Sluis wrote:
>>> Hi,
>>> I just ran into some peculiar behavior.
>>> For my current project I have to import 633 files each containing 
>>> approx 20 MB of xml data (a total of 13 GB). When importing this 
>>> data into a single graph I hit an out of memory exception on the 7th 
>>> file.
>>> Looking at the heap I noticed that after restarting the application 
>>> I could load a few more files. So I started looking for the bundle 
>>> that consumed all the memory. It happened to be the Clerezza TDB 
>>> Storage provider. See the following image (GC = garbage collection):
>>> Looking more closely I noticed that Apache Jena is able to close a 
>>> graph (graph.close()) But Clerezza is not using this feature and is 
>>> keeping the graph open all the time.
>> Jena graphs backed by TDB are simply views of the dataset - they 
>> don't have any state associated with them directly.  If the reference 
>> become inaccessible, GC should clean up.
> Hi Andy,
> The problem, as far as I can tell, is not in Jena TDB itself. The Jena 
> TDB bundle is still active/running. Only the Clerezza TDB Provider 
> bundle is stopped (by me). Like my image shows a normal GC does not 
> release all of the memory. Only after stopping the Clerezza TDB 
> Provider memory allocated for importing is release. Because of 
> stopping this particular bundle all jena datastructures become 
> inaccessible and eligible for GC. Just like the image shows.
> My reasoning is that since the Clerezza TDB Provider has a map with 
> weak references to Jena models these references are never properly 
> garbage collected. Since I use the same graph all the time all data 
> gets accumulated and resulting in out of memory. Looking at a memory 
> dump, most space is occupied by byte arrays containing the imported data.
> I use a nasty hack to prevent this dreaded out of memory. After every 
> import I restart the Clerezza TDB Provider bundle programmatically 
> (hail OSGI for I wouldn't know how to do this without OSGI). Like this 
> I have been able to import more that 300 files in a row (still running).
> Regards,
> Minto

It does look like something in Clerezza is holding memory.  Do note that 
TDB has internal caches so it wil grow a well.  Dataset are kept around 
because they are expensive to re-warm up, and the node table cache is 
in-heap.  Other caches are not in-heap (64 bit mode)

If you want to bulk import, you could load the TDB database directly, 
using the bulk loader.  Indeed, it can be worthwhile taking the input, 
creating an N-Quads file, with lots of checking and validation of the 
data, then loading the N-Quads. It's annoying to get part way though a 
large load and find the data isn't perfect.


View raw message