incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Minto van der Sluis <mi...@xup.nl>
Subject Might I be using Clerezza in the wrong way.
Date Fri, 22 Feb 2013 11:52:20 GMT
Hi folks,

I am starting to wonder if I use Clerezza and semantic technology in
general in the wrong way. To make it more clear I will first describe my
situation and then why I think I might be using it incorrectly.

Context
=======
Basically we are gathering and distributing annotations. For this we
make use of OpenAnnotation (OA, see [1]).  Since OA is based on RDF we
were looking for products capable of storing this data. We decided to
use Clerezza as an abstraction for the actual storage layer. Like this
we are able can switch storage engines quite easily.

Now it turns out that our annotations should support annotations on
annotations. Amongst others this is to be able to tell if a root
annotation has been properly processed or rejected (status change). This
leads us to the notion of annotation trees. Every one of these trees
starts with a single annotation as the root/trunk.

The system we work on not only stores annotation but also has to return
complete annotation trees. For this reason we decided to store every
tree in its own named graphs. Like this we can easily retrieve a full
tree by returning the complete named graph. The downside of it well be
that we will end up with a massive number of (small) named graphs.

For the storage we decide (for the time being) to use
SingleTdbDatasetTcProvider. Here also lies the root cause why I started
wondering if we are on the right track. Looking at the
SingleTdbDatasetTcProvider implementation I have the following observations:

Observations
===========
1) SingleTdbDatasetTcProvider keeps names of graphs in 2 separate sets.
This does not seem to be very efficient for large amounts of graphnames
(100k+ or possible 1m+).

    private Set<UriRef> graphNames;
    private Set<UriRef> mGraphNames;


2) All graphnames are logged on startup (activation). This is feasible
for a small number, but not for a rather large number of named graphs.

3) FileTcProvider (rdf.file.storage) also keeps names in memory.

    private Map<UriRef, FileMGraph> uriRef2MGraphMap =
            new HashMap<UriRef, FileMGraph>();

4) SesameNativeWeightedProvider () keeps not only the names in memory,
but the graph objects as well.

    private HashMap<UriRef, SesameMGraph> mGraphs;
    private HashMap<UriRef, SesameGraph> graphs;

Are we approaching this incorrectly or are we running into limitations
of the current implementation? In other words is a large number of named
graphs supported or isn't Clerezza and maybe even semantic technology in
general designed for this?

Any thoughts?

Regards,

Minto

-- 
ir. ing. Minto van der Sluis
Software innovator / renovator
Xup BV

[1] http://www.openannotation.org/


Mime
View raw message