incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rupert Westenthaler <rupert.westentha...@gmail.com>
Subject Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files
Date Mon, 19 Mar 2012 10:54:41 GMT
Hi,

On 19.03.2012, at 09:16, Hasan Hasan wrote:

> Hi all,
> 
> I generally agree to extend Clerezza to be able to support multiple
> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
> Although I am bit unhappy, due to the fact, that application developers
> have to be aware of this.

A good documentation should help with that. 

> Note that, new clerezza instances (at least my own build) do not anymore
> generate 200 MB of index files for empty graphs, but merely 200K.
> 

I tested this against both

* The Stanbol 0.9.0-incubating RC 3 and
* The unit tests of "rdf.jena.tdb.storage" trunk

in both cases TDB directories where > 200 MByte.

After spending some time with different Google queries I was able to find

    http://tech.groups.yahoo.com/group/jena-dev/message/46144

what nicely describe the observed behavior. 

But even if we define this as an bug how MAC OS handles sparse file there is still the problem
with the exploding number of open files that will killing the JVM (and possible even the host
system).


On 19.03.2012, at 10:45, Daniel Spicar wrote:

> Hi Rupert,
> 
> I ran into a similar problem when I worked on a Jena SDB storage provider
> (not have to create separate databases for each Clerezza graph). Back then
> I didn't create a proper solution so I am interested in your approach. From
> what you described it sounds good to me.
> 

I created https://issues.apache.org/jira/browse/CLEREZZA-691 for SingleDatasetTdbTcProvide

I have implemented a SingleDatasetTdbTcProvide over the weekend. It passes already the MGraph
related tests, but still fails the TcProvider tests as I need to add support for using the
same graph name for both a MGrpah and a Graph (as required by the TcProviderTest). 
Is this really necessary or only something that is accidentally used by the TcProviderTest?
Something that can not be "natively" supported when using a single Dataset as named graph
names MUST be unique.

Currently I am developing this as part of the "rdf.jena.tdb.storage" as I think there is no
need to have an own module for an 2nd variant of an TcProvider that is based on the same underlaying
technology.

As soon as it passes the same set of tests as used for the "TdbTcProvider" I will share the
code. I would also like to test it within Apache Stanbo, but this could be hard as I would
need to change the Clerezza dependencies in the trunk from the Clerezza release to the current
SNAPSHOT versions.

Would you prefer a patch or should I commit directly to trunk? An issue branches seams not
to be needed as this additions will not affect current functionalities. WDYT?


> There are a couple of things to keep in mind. I think they are both handled
> on a higher layer and should work transparently but it's good to keep it in
> mind.
> 1. Graph permissions need to work. I think they work via the graph
> URI/name, so they may be handled transparently.
> 2. Make sure rdf.storage.externalizer works with your solution.
> 

I have never used those things. I will have a look, but it would be wise if someone with more
knowledge can validate this after I have provided a first version

best
Rupert

> Best,
> Daniel
> 
> On 19 March 2012 09:16, Hasan Hasan <hasan@trialox.org> wrote:
> 
>> Hi all,
>> 
>> I generally agree to extend Clerezza to be able to support multiple
>> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
>> Although I am bit unhappy, due to the fact, that application developers
>> have to be aware of this.
>> Note that, new clerezza instances (at least my own build) do not anymore
>> generate 200 MB of index files for empty graphs, but merely 200K.
>> 
>> Regards
>> Hasan
>> 
>> 
>> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
>> rupert.westenthaler@gmail.com> wrote:
>> 
>>> Hi David, stanbol & clerezza community
>>> 
>>> Short summary of the situation:
>>> 
>>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>>> provider. This causes the disc consumption and number of open files to
>>> explode. See the quoted emails for details
>>> 
>>> 
>>> @Stanbol  we are already discussion how to avoid the creation of such
>> many
>>> graphs
>>> 
>>> 
>>> @Clerezza the observed behavior of the TDB provider is also very
>> dangerous
>>> (at least for typical use cases in Apache Stanbol).
>>> 
>>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>>> solution for that as it suggests to use named graphs instead of isolated
>>> TDB instances for creating MGraphs.
>>> 
>>> To be honest this would be the optimal solution for our usages of
>> Clerezza
>>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>>> different TDB datasets.
>>> 
>>> Because of that I  would like to make the following proposal that
>>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>> 
>>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>>> store Clerezza MGraphs in Jena TDB
>>> 
>>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>>> follows the currently used methodology to map Clerezza graphs to separate
>>> TDB datasets
>>> 
>>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>>> MGraphs in a single TDB dataset. This provider should also support
>>> "configurationFactory=true" (multiple instances). each instance would
>> use a
>>> different TDB dataset to store its MGrpahs.
>>> 
>>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>>> requires a configuration of the directory for the  TDB dataset as well
>> as a
>>> name (that can be used in Filters). This ensures full backward
>>> compatibility.
>>> 
>>> In environment - such as Stanbol - where you want to store multiple
>> graphs
>>> in the same TDB dataset you would need to provide a configuration for the
>>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>> 
>>> * if you just need a single TDB dataset that stores all MGraphs, than you
>>> can assign a high enough service.ranking to the
>> SingleDatasetTdbTcProvider
>>> and normally use the TcManager to create your graphs.
>>> * if you want to use single TDB datasets or a mix of the TdbTcProvider
>> and
>>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>> 
>>> 
>>> WDYT
>>> Rupert
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>> 
>>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>> 
>>>> Hi David, all
>>>> 
>>>> this could be the explanation for the failed build on the Jenkins
>> server
>>> when the SEO configuration for the Refactor engine was used in the
>> default
>>> configuration of the Full launcher
>>>> 
>>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>> 
>>>> For me that looks like as if the RefactorEngine does create multiple
>>> Jena TDB instances for various created MGraphs. One needs to know the
>> even
>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>> important to map multiple MGraphs to different named graphs of the same
>>> Jena TDB store.
>>>> 
>>>> I have no Idea how Clerezza manages this or how Ontonet creates
>> MGraphs,
>>> but I hope this can help in tracing this down.
>>>> 
>>>> best
>>>> Rupert
>>>> 
>>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>> 
>>>>> Dears,
>>>>> 
>>>>> As I ran into disk issues, I found that this folder:
>>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>> 
>>>>> where XX is the bundle of:
>>>>> Clerezza - SCB Jena TDB Storage Provider
>>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>> 
>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>> exhausted).
>>>>> 
>>>>> These are some of the files I found inside:
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>> 
>>>>> 
>>>>> Any clues?
>>>>> 
>>>>> Thanks,
>>>>> David Riccitelli
>>>>> 
>>>>> 
>>> 
>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<
>>> 
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> 
>>>>> 
>>> 
>> ********************************************************************************
>>>> 
>>> 
>>> 
>> 


Mime
View raw message