incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Adamou <ada...@cs.unibo.it>
Subject Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files
Date Wed, 04 Apr 2012 17:18:41 GMT
Hi Rupert, all,

just telling you that I have tried the SingleTdbDatasetTcProvider on the 
field with one of my use cases which involves many small ontologies 
(content design patterns).

I've created ~20 graphs totalling about 500 triples

On OS X 10.6.8 (on HFS+ filesystem with journalling) the database grew 
from an initial 184MiB to 248MiB

I am yet to test large graphs, so I cannot tell if the overhead is given 
by named graph indexes or the triple storage, but this is already a big 
leap from the TdbTcProvider.

Did you already commit this component to rdf.jena.tdb.storage ?

Best,

Alessandro

On 3/19/12 9:16 AM, Hasan Hasan wrote:
> Hi all,
>
> I generally agree to extend Clerezza to be able to support multiple
> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
> Although I am bit unhappy, due to the fact, that application developers
> have to be aware of this.
> Note that, new clerezza instances (at least my own build) do not anymore
> generate 200 MB of index files for empty graphs, but merely 200K.
>
> Regards
> Hasan
>
>
> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler<
> rupert.westenthaler@gmail.com>  wrote:
>
>> Hi David, stanbol&  clerezza community
>>
>> Short summary of the situation:
>>
>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>> provider. This causes the disc consumption and number of open files to
>> explode. See the quoted emails for details
>>
>>
>> @Stanbol  we are already discussion how to avoid the creation of such many
>> graphs
>>
>>
>> @Clerezza the observed behavior of the TDB provider is also very dangerous
>> (at least for typical use cases in Apache Stanbol).
>>
>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>> solution for that as it suggests to use named graphs instead of isolated
>> TDB instances for creating MGraphs.
>>
>> To be honest this would be the optimal solution for our usages of Clerezza
>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>> different TDB datasets.
>>
>> Because of that I  would like to make the following proposal that
>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>
>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>> store Clerezza MGraphs in Jena TDB
>>
>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>> follows the currently used methodology to map Clerezza graphs to separate
>> TDB datasets
>>
>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>> MGraphs in a single TDB dataset. This provider should also support
>> "configurationFactory=true" (multiple instances). each instance would use a
>> different TDB dataset to store its MGrpahs.
>>
>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>> requires a configuration of the directory for the  TDB dataset as well as a
>> name (that can be used in Filters). This ensures full backward
>> compatibility.
>>
>> In environment - such as Stanbol - where you want to store multiple graphs
>> in the same TDB dataset you would need to provide a configuration for the
>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>
>> * if you just need a single TDB dataset that stores all MGraphs, than you
>> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
>> and normally use the TcManager to create your graphs.
>> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>
>>
>> WDYT
>> Rupert
>>
>>
>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>
>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>
>>> Hi David, all
>>>
>>> this could be the explanation for the failed build on the Jenkins server
>> when the SEO configuration for the Refactor engine was used in the default
>> configuration of the Full launcher
>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>
>>> For me that looks like as if the RefactorEngine does create multiple
>> Jena TDB instances for various created MGraphs. One needs to know the even
>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>> important to map multiple MGraphs to different named graphs of the same
>> Jena TDB store.
>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>> but I hope this can help in tracing this down.
>>> best
>>> Rupert
>>>
>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>
>>>> Dears,
>>>>
>>>> As I ran into disk issues, I found that this folder:
>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>
>>>> where XX is the bundle of:
>>>> Clerezza - SCB Jena TDB Storage Provider
>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>
>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>> exhausted).
>>>>
>>>> These are some of the files I found inside:
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>
>>>>
>>>> Any clues?
>>>>
>>>> Thanks,
>>>> David Riccitelli
>>>>
>>>>
>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> ********************************************************************************
>>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice


Mime
View raw message