incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reto Bachmann-Gmür <r...@apache.org>
Subject Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)
Date Thu, 15 Nov 2012 16:20:21 GMT
On Wed, Nov 14, 2012 at 8:32 PM, Sebastian Schaffert <
sebastian.schaffert@salzburgresearch.at> wrote:

>
> Am 13.11.2012 um 14:50 schrieb Reto Bachmann-Gmür:
>
> > On Tue, Nov 13, 2012 at 1:31 PM, Sebastian Schaffert <
> > sebastian.schaffert@salzburgresearch.at> wrote:
> > [...]
> >
> >>
> >> Despite the solution I described, I still do not think the scenario is
> >> well suited for evaluating RDF APIs. You also do not use Hibernate to
> >> evaluate whether an RDBMS is good or not.
> >>
> > The usecase I propose and I don't think this is the only one, I just
> think
> > that API comparison should be based on evaluating their suitability for
> > different concretely defined usecases. It has nothing to do with
> > hibernation neither with annotation based object to rdf property mapping
> > (as there have been several proposals). Its the same principle of any23
> or
> > aperture but not on the binary data level but on the java object level.
>
> The Java domain object level is one level of abstraction above the data
> representation/storage level. I was mentioning Hibernate as an example of a
> generic mapping between the java object level and the data representation
> level (even though in this case it is relational database, the same can be
> done for RDF). The Java object level does not really allow to draw good
> conclusions about the data representation level.
>

We are talking about an API for modelling the entities introduced by RDF
and related specs. How implementation store the data if in ram, quantum
storages or engraved in stone is just completely irrevant for this
discussion.


>
> > I have my instrastructure that deals with graphs I have the a Set of
> contacts
> > how does the missing bit look like to process this set with my rdf
> > infrastructure. Its a reality that people don't (yet) have all their data
> > as graphs, they might have some contacts in LDAP and some mails on an
> Imap
> > server.
>
>
> I showed you an example of annotation based object to RDF mapping to fill
> exactly that missing bit. This implementation works on any RDF API (we had
> it in Sesame, in KiWi, and now in the LMF) and has been done by several
> other people as well. It does not really help much in deciding how the RDF
> API itself should look like, though.
>
Exactly. That's why the discussion is by no means required to show how the
Toy-Usecase can be implemented with Jena, Sesame, Clerezza, Banana or XY
API.


>
> >
> >
> >>>>
> >>>> If this is really an issue, I would suggest coming up with a bigger
> >>>> collection of RDF API usage scenarios that are also relevant in
> practice
> >>>> (as proven by a software project using it). Including scenarios how
to
> >> deal
> >>>> with bigger amounts of data (i.e. beyond toy examples). My scenarios
> >>>> typically include >= 100 million triples. ;-)
> >>>>
> >>>> In addition to what Andy said about wrapper APIs, I would also like
to
> >>>> emphasise the incurred memory and computation overhead of wrapper
> APIs.
> >> Not
> >>>> an issue if you have only a handful of triples, but a big issue when
> you
> >>>> have 100 million.
> >>
> > A wrapper doesn't means you have an in memory objects for all your
> triples
> > of your store, that's absurd. But if your code deals with some resources
> at
> > runtime these resource are represented by object instances which contain
> at
> > least a pointer to the resource located of the RAM. So the overhead of a
> > wrapper is linear to the amount of RAM the application would need anyway
> > and independent of the size of the triple store.
>
> So in other words: instead of a server with 8GB I might need one with 10GB
> RAM, just because I decided using a wrapper instead of the native API. Or
> to put it differently: with the same server I can hold less objects in my
> in-memory cache, possibly sacrificing a lot of processing time. From my
> experience, it makes a big difference.
>
Well then you probably shouldn't be using any higher level language or
abstraction. 25% which would be almost 8 months waiting by Moore's law I
think is a huge exaggeration of the overhead.

But again I'm not arguing in favour of wrappers, I want to discuss how the
best API should look like. If this API is then adopted by implementor and
if not if you use a wrapper, wait a couple of months to have the ram
required for the overhead at the same price, invest a bit more now or
decide not to use the best API in for saving RAM is out of scope.



>
> > Besides I would like to
> > compare possible APIs here, ideally the best API would be largely adopted
> > making wrapper superfluous. (I could also mention that the jena Model
> class
> > also wraps a Graph instance)
>
> Agreed.
>
> >
> >
> >>
> >>> It's a common misconception to think that java sets are limited to
> 231-1
> >>> elements, but even that would be more than 100 millions. In the
> >> challenge I
> >>> didn't ask for time complexity, it would be fair to ask for that too if
> >> you
> >>> want to analyze scenarios with such big number of triples.
> >>
> >> It is a common misconception that just because you have a 64bit
> >> architecture you also have 2^64 bits of memory available. And it is a
> >> common misconception that in-memory data representation means you do not
> >> need to take into account storage structures like indexes. Even if you
> >> represent this amount of data in memory, you will run into the same
> problem.
> >>
> >> 95% of all RDF scenarios will require persistent storage. Selecting a
> >> scenario that does not take this into account is useless.
> >>
> >
> > I don't know where your RAM fixation comes from.
>
> I started programming with 64kbyte and grew up into Computer Science when
> "640kbyte ought to be enough for anyone" ;-)
>
> Joke aside, it comes from the real world use cases we are working on, e.g.
> a Linked Data and Semantic Search server at http://search.salzburg.com,
> representing about 1,2 million news articles as RDF, resulting in about 140
> million triples. It also comes from my experience with IkeWiki, which was a
> Semantic Wiki system completely built on RDF (using Jena at that time).
>
> The server the partner has provided us with for the Semantic Search has
> 3GB of RAM and is a virtual VMWare instance with not the best I/O
> performance. Importing all news articles on this server and processing them
> takes 2 weeks (after spending many days doing performance profiling with
> YourKit and identifying bottlenecks and unnecessary overheads like wrappers
> or proxy classes). If I have a wrapper implementation inbetween, even
> lightweight, maybe just takes 10% more, i.e. 1,5 days! The performance
> overhead clearly matters.
>
> In virtually all my RDF projects of the last 10-12 years, the CENTRAL
> issues were always efficient/effective/reliable/convenient storage and
> efficient/effective/reliable/convenient querying (in parallel
> environments). These are the criteria an RDF API should IMHO be evaluated
> against.

It an API is designed in a way that implementations are necessarily less
perfomant than implementation of other API than can used to solve the same
usecase than that's a strong argument against an API-



> In my personal experience, the data model and repository API of Sesame was
> the best choice to cover these scenarios in all different kinds of use
> cases I had so far (small data and big data). It was also the most flexible
> option, because of its consistent use of interfaces and modular choice of
> backends. Jena comes close, but did not yet go through the architectural
> changes (i.e. interface based data model) that Sesame already did with the
> 2.x series. Clerezza so far is not a real option to achieve my goals. It is
> good and convenient when working with small in-memory representations of
> graphs, but (as we discussed before) lacks for me important persistence and
> querying features. If I am purely interested in Sets of triples, guess
> what: I create a Java Set and put triples in it. For example, we even have
> an extended set with a (limited) query index support [1], which I created
> out of realizing that we spent a considerable time just iterating
> unnecessarily over sets. No need for a new API.
>
java.util.Set by itself is a poor API for triples. Besides being incomplete
as it doesn't define how triples and resources look like it doesn't support
a way to filter triples with a triple pattern. Furthermore the identity of
graphs is defined differently than the one of sets. The clerezza API
extends the Collection API (a Graph is not a set) so that the API can be
used for for 120 as well as for 120 billions triples.



>
> [1]
> http://code.google.com/p/lmf/source/browse/lmf-core/src/main/java/kiwi/core/model/table/TripleTable.java
>
> > My usecases doesn't mandate in memory storage in any way. The 2^31-1
> misconception comes not
> > from 32bit architecture but from the fact that Set.size() is defined to
> > return an int value (i.e. a maximum of 2^31-1) but the API is clear that
> a
> > Set can be bigger than that.
>
> I did not come up with any 2^31 misconception. And *of course* the 2^31-1
> topic is originally caused by 32 bit architectures, because this is why
> integer (in Java) is defined as 32bit (the size you can store in a
> processor register so simple computations only require a single instruction
> of the processor). And the fact that Java is using 32bit ints for many
> things DOES cause problems, as Rupert can tell you from experience: it
> might e.g. happen that two completely different objects share the same hash
> code, because the hash code is an integer while the memory address is a
> long.
>
> What I was referring to is that regardless the amount of memory you have,
> persistence and querying is the core functionality of any RDF API. The use
> cases where you are working with RDF data and don't need persistence are
> rare (serializing and deserializing domain objects via RDF comes to my
> mind) and for consistency reasons I prefer treating them in the same way as
> the persistent cases,

I agree so far. But what does this have to do with the usecase? the usecase
never says that the data should be in memory.


> even if it means that I have to deal with persistence concepts (e.g.
> repository connections or transactions) without direct need. On the other
> hand, persistence comes with some important requirements, which are known
> for long and summarized in the ACID principles, and which need to be
> satisfied by an RDF API.
>
No full ACID support is requirement in some situations but definitively not
in all situation where you have large amount of data. It's a typical
enterprise requirement in which case you probably also want your
transaction to span different systems and not be confined to the RDF
repository and are happy to technologies like JTA.


>
> > And again other usecase are welcome, lets
> > look at how they can be implemented with different APIs, how elegant the
> > solutions are, what they runtime properties are and of course how
> relevant
> > the usecases are to find the most suitable API.
>
>
> Ok, my challenges (from a real project):
> - I want to be able to run a crawler over skiing forums, extract the
> topics, posts, and user information from them, perform a POS tagging and
> sentiment analysis and store the results together with the post content in
> my RDF repository;
>
Ok. What exactly would you like to see, You get some graph or graphs from
tge crawler, have these graphs enriched by the POS tager and analyzer and
do myRepo.addAll(enrichedGraph) at the end. Maybe you could strip down the
usecase to the relevant parts, show me the solution in your favourite API,
then I translate it to Clerezza and then we can see what is missing?


> - in case one of the processes inbetween fails (e.g. due to a network
> error), I want to properly roll back all changes made to the repository
> while processing this particular post or topic
>

Ok, probably the crawler should go back as well. So this sounds like a
usecase for JTA which is orthogonal to the RDF API


> - I want to expose this dataset (with 10 million posts and 1 billion
> triples) as Linked Data, possibly taking into account a big number of
> parallel requests on that data (e.g. while Linked Data researchers are
> preparing their articles for ISWC)
>
- I want to run complex aggregate queries over big datasets (while the
> crawling process is still running!), e.g. "give me all forum posts out of a
> set of 10 million on skiing that are concerned with 'carving skiing' with
> an average sentiment of >0.5 for mentionings of the noun phrase 'Atomic
> Racer SL6' and display for each the number of replies in the forum topic"
>
And you don't just want to pass a SPARQL query but would like to have
defined combined indexes via the API before, is that the challenge?
(Clerezza has this an extension on top but wouldn't it be better to focus
on the core API before?)



> - I want to store a SKOS thesaurus on skiing in a separate named graph and
> run queries over the combination of the big data set of posts and the small
> thesaurus (e.g. to get the labels of concepts instead of the URI)
>
Isn't this just sparql?


> - I want to have a configurable rule-based reasoner where I can add simple
> rules like a "broaderTransitive" rule for the SKOS broader relationship; it
> has to run on 1 billion triples
>
Ok, a useful feature that goes beyond modelling RDF specs in Java. In the
interest of modularity of the API I would suggest to first focus on
usecases on the level of the spec family around RDF. Or does any of the API
you mentioned Jena, Sesame or Clerezza supports such a feature?


> - I want to repeat the crawling process every X days, possibly updating
> post data in case something has changed, even while another crawling
> process is running and another user is running a complex query
>
Again, I don't see the API requirement here. Could you describe from the
client perspective maybe: "the client has to be able to tell when a data
update transaction involving multiple operations starts and when it ends,
before it ends other client shall see the data without any modification..."
if that's the requirement we would be back to the transaction support
requirements and so back what probably could be solved with JTA,


>
> With the same API model (i.e. without learning a different API), I also
> want to:
> - with a few lines import a small RDF document into memory to run some
> small tests
> - take a bunch of triples and serialize them as RDF/XML or N3
>

Sure.

Would be handy handy if could boil down the hard part to want to address so
that the example code in your favourite API fits on a page and the we can
compare it with other design alternatives.

Cheers,
Reto

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message