incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <andy.seabo...@epimorphics.com>
Subject Re: leak but where after parsing rdf files?
Date Fri, 21 Jan 2011 11:37:35 GMT


On 20/01/11 19:43, Reto Bachmann-Gmuer wrote:
> HI Andy
>
> I've committed an application that uses directly jena without clerezza stuff


> in the middle that demonstarts the problem.
>
> Starting it with
>
> MAVEN_OPTS="-Xmx256m -Xms128m"  mvn clean install exec:java -o -e
>
> it will fail at one of the files, howver if I change the order in which the
> files are to be parsed and put the file it was failing at at the begginning
> it suceeds parsing this file and will fail at another one.
>
> the app is here:
> http://svn.apache.org/viewvc/incubator/clerezza/issues/CLEREZZA-384/turtlememory

Not entirely without clerezza stuff - the POM does not work standalone.

After some POM hacking, I got it working.  I take it the test is 
"TestWithFiles".

It's not using RIOT because that's not in the Jena download yet.

Add

     <dependency>
       <groupId>com.hp.hpl.jena</groupId>
       <artifactId>arq</artifactId>
       <version>2.8.7</version>
     </dependency>

and either:

	com.hp.hpl.jena.query.ARQ.init() ;

or

	org.openjena.riot.SysRIOT.wireIntoJena() ;

With this the test passes (and much faster as well).


The test is not just parsing.  It's storing the results in a model so 
the space needed included complete storage of the model.

Only a small increase in -Xmx (e.g. 350m) and the test passes.

The test fails in the first pass over the files if it's going to fail. I 
suspect that one or more internal systems have fixed size caches.  Jena 
does.  JavaCC has expanding buffering (and you have some very large 
literals).

Jena's caches are bounded by number of slots so churning based on large 
literals will need to settle down before any conclusions of a memory 
leak can be made.  Hence failing on the first pass is not suggestive of 
a memory leak.  This is backed up by the fact file order matters.

JavaCC used by the old parser uses expanding buffers and your long 
literals will force those larger and hence the runtime working space is 
higher on a single file parse.  RIOT uses a fixed size buffer and builds 
the large literals directly into the string to be used as the RDF node.

As increasing the heap means that the test runs and the test fails in 
the first pass over the files if it is going to fail, I conclude it's 
various caches filling up and just not fitting. I guess it passes at 
256m with RIOT by chance.  Slightly less overhead meaning that caches 
just happen to fit.

There is a streaming interface to RIOT in org.openjena.riot.RiotReader.

	Andy

Mime
View raw message