incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rupert Westenthaler <rupert.westentha...@gmail.com>
Subject Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files
Date Fri, 16 Mar 2012 13:10:05 GMT
Hi David, stanbol & clerezza community

Short summary of the situation:

The Ontonet component generate a lot of MGraphs using the Jena TDB provider. This causes the
disc consumption and number of open files to explode. See the quoted emails for details


@Stanbol  we are already discussion how to avoid the creation of such many graphs


@Clerezza the observed behavior of the TDB provider is also very dangerous (at least for typical
use cases in Apache Stanbol).

Even targeting at a different CLEREZZA-467 maybe provides a possible solution for that as
it suggests to use named graphs instead of isolated TDB instances for creating MGraphs.

To be honest this would be the optimal solution for our usages of Clerezza in Stanbol. However
I assume that for a semantic CMS it is saver to use different TDB datasets.

Because of that I  would like to make the following proposal that hopefully covers both the
needs of Apache Stanbol and Apache Clerezza.

1. AbstractTdbTcProvider: providing most of the functionality needed to store Clerezza MGraphs
in Jena TDB

2. TdbTcProvider: The same as now but now extending the abstract one. I follows the currently
used methodology to map Clerezza graphs to separate TDB datasets

3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all MGraphs in a single TDB
dataset. This provider should also support "configurationFactory=true" (multiple instances).
each instance would use a different TDB dataset to store its MGrpahs.

By default the SingleDatasetTdbTcProvider would be inactive, because it requires a configuration
of the directory for the  TDB dataset as well as a name (that can be used in Filters). This
ensures full backward compatibility.

In environment - such as Stanbol - where you want to store multiple graphs in the same TDB
dataset you would need to provide a configuration for the SingleDatasetTdbTcProvider. Here
you have two possible usage scenarios:

* if you just need a single TDB dataset that stores all MGraphs, than you can assign a high
enough service.ranking to the SingleDatasetTdbTcProvider and normally use the TcManager to
create your graphs.
* if you want to use single TDB datasets or a mix of the TdbTcProvider and SingleDatasetTdbTcProvider's
you will need to use according filters.


WDYT
Rupert


[1] https://issues.apache.org/jira/browse/CLEREZZA-467

On 16.03.2012, at 10:44, Rupert Westenthaler wrote:

> Hi David, all
> 
> this could be the explanation for the failed build on the Jenkins server when the SEO
configuration for the Refactor engine was used in the default configuration of the Full launcher
> 
> see http://markmail.org/message/sprwklaobdjankig for details.
> 
> For me that looks like as if the RefactorEngine does create multiple Jena TDB instances
for various created MGraphs. One needs to know the even for an empty graph Jena TDB creates
~200MByte of index files. So it is important to map multiple MGraphs to different named graphs
of the same Jena TDB store.
> 
> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs, but I hope this
can help in tracing this down.
> 
> best
> Rupert 
> 
> On 16.03.2012, at 10:30, David Riccitelli wrote:
> 
>> Dears,
>> 
>> As I ran into disk issues, I found that this folder:
>> sling/felix/bundleXXX/data/tdb-data/mgraph
>> 
>> where XX is the bundle of:
>> Clerezza - SCB Jena TDB Storage Provider
>> org.apache.clerezza.rdf.jena.tdb.storage
>> 
>> took almost 70 gbytes of disk space (then the disk space has been
>> exhausted).
>> 
>> These are some of the files I found inside:
>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>> 
>> 
>> Any clues?
>> 
>> Thanks,
>> David Riccitelli
>> 
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>> ********************************************************************************
> 


Mime
View raw message