incubator-stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Adamou <ada...@cs.unibo.it>
Subject Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files
Date Thu, 05 Apr 2012 10:47:42 GMT
Hi Rupert, here are a few more numbers:

on the same setting I loaded the NCI ontology from 
http://www.mindswap.org/2003/CancerOntology/ (about 400k triples, 
lightly axiomatized with DL flavor ALE)

on the SingleTdbDatasetTcProvider the storage directory grew by 156 MiB 
(192 -> 348)

on the TdbTcProvider the newly created dir was 76 MiB above the initial 
capacity (192 -> 268)

Then I bzipped both directories to see if it was partly "filling" the 
initial 192 MiB :
- the SingleTdbDatasetTcProvider one shrunk to ~25 MiB
- the TdbTcProvider one shrunk to ~17 MiB

I guess this overhead is due to having to store a lot more quadruples 
due to the named graphs. I noticed that the files 
(GOSP|GPOS|GSPO|OSPG|POSG|SPOG).dat which I assume store quadruples are 
each 4 times as large in the SingleTdbDatasetTcProvider database, 
whereas the triples (OSP|POS|SPO).dat were the same size. I guess this 
redundancy is the price paid for fast access.

Perhaps mine is a fuzzy interpretation though? Still, it looks pretty 
good to me.

Best,

Alessandro


----------

On 4/4/12 7:31 PM, Rupert Westenthaler wrote:
> On 04.04.2012, at 19:18, Alessandro Adamou wrote:
>
>> Hi Rupert, all,
>>
>> just telling you that I have tried the SingleTdbDatasetTcProvider on the field with
one of my use cases which involves many small ontologies (content design patterns).
>>
>> I've created ~20 graphs totalling about 500 triples
>>
>> On OS X 10.6.8 (on HFS+ filesystem with journalling) the database grew from an initial
184MiB to 248MiB
>>
>> I am yet to test large graphs, so I cannot tell if the overhead is given by named
graph indexes or the triple storage, but this is already a big leap from the TdbTcProvider.
>>
> Thx for testing.
>
>> Did you already commit this component to rdf.jena.tdb.storage ?
>>
> No not yet, but I have made some improvements and fixed some bugs since the last patch
attached to the Issue. I hope I will have some time to finish this later this week.
>
> best
> Rupert
>
>> Best,
>>
>> Alessandro
>>
>> On 3/19/12 9:16 AM, Hasan Hasan wrote:
>>> Hi all,
>>>
>>> I generally agree to extend Clerezza to be able to support multiple
>>> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
>>> Although I am bit unhappy, due to the fact, that application developers
>>> have to be aware of this.
>>> Note that, new clerezza instances (at least my own build) do not anymore
>>> generate 200 MB of index files for empty graphs, but merely 200K.
>>>
>>> Regards
>>> Hasan
>>>
>>>
>>> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler<
>>> rupert.westenthaler@gmail.com>   wrote:
>>>
>>>> Hi David, stanbol&   clerezza community
>>>>
>>>> Short summary of the situation:
>>>>
>>>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>>>> provider. This causes the disc consumption and number of open files to
>>>> explode. See the quoted emails for details
>>>>
>>>>
>>>> @Stanbol  we are already discussion how to avoid the creation of such many
>>>> graphs
>>>>
>>>>
>>>> @Clerezza the observed behavior of the TDB provider is also very dangerous
>>>> (at least for typical use cases in Apache Stanbol).
>>>>
>>>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>>>> solution for that as it suggests to use named graphs instead of isolated
>>>> TDB instances for creating MGraphs.
>>>>
>>>> To be honest this would be the optimal solution for our usages of Clerezza
>>>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>>>> different TDB datasets.
>>>>
>>>> Because of that I  would like to make the following proposal that
>>>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>>>
>>>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>>>> store Clerezza MGraphs in Jena TDB
>>>>
>>>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>>>> follows the currently used methodology to map Clerezza graphs to separate
>>>> TDB datasets
>>>>
>>>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>>>> MGraphs in a single TDB dataset. This provider should also support
>>>> "configurationFactory=true" (multiple instances). each instance would use
a
>>>> different TDB dataset to store its MGrpahs.
>>>>
>>>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>>>> requires a configuration of the directory for the  TDB dataset as well as
a
>>>> name (that can be used in Filters). This ensures full backward
>>>> compatibility.
>>>>
>>>> In environment - such as Stanbol - where you want to store multiple graphs
>>>> in the same TDB dataset you would need to provide a configuration for the
>>>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>>>
>>>> * if you just need a single TDB dataset that stores all MGraphs, than you
>>>> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
>>>> and normally use the TcManager to create your graphs.
>>>> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
>>>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>>>
>>>>
>>>> WDYT
>>>> Rupert
>>>>
>>>>
>>>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>>>
>>>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>>>
>>>>> Hi David, all
>>>>>
>>>>> this could be the explanation for the failed build on the Jenkins server
>>>> when the SEO configuration for the Refactor engine was used in the default
>>>> configuration of the Full launcher
>>>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>>>
>>>>> For me that looks like as if the RefactorEngine does create multiple
>>>> Jena TDB instances for various created MGraphs. One needs to know the even
>>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>>> important to map multiple MGraphs to different named graphs of the same
>>>> Jena TDB store.
>>>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>>>> but I hope this can help in tracing this down.
>>>>> best
>>>>> Rupert
>>>>>
>>>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>>>
>>>>>> Dears,
>>>>>>
>>>>>> As I ran into disk issues, I found that this folder:
>>>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>>>
>>>>>> where XX is the bundle of:
>>>>>> Clerezza - SCB Jena TDB Storage Provider
>>>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>>>
>>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>>> exhausted).
>>>>>>
>>>>>> These are some of the files I found inside:
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>>>
>>>>>>
>>>>>> Any clues?
>>>>>>
>>>>>> Thanks,
>>>>>> David Riccitelli
>>>>>>
>>>>>>
>>>> ********************************************************************************
>>>>>> InsideOut10 s.r.l.
>>>>>> P.IVA: IT-11381771002
>>>>>> Fax: +39 0110708239
>>>>>> ---
>>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> Twitter: ziodave
>>>>>> ---
>>>>>> Layar Partner Network<
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> ********************************************************************************
>>>>
>>
>> -- 
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, so long as you do not demand anything."
>> (Ettore Petrolini, 1930)
>>
>> Not sent from my iSnobTechDevice
>>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice


Mime
View raw message