any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <>
Subject Re: Advertising Any23 to Jena
Date Sun, 18 Nov 2012 18:19:33 GMT
> OK, thank you for making this explicit. I suppose my curiosity here
> revolved around where we (as an Any23 community) could/want to get
> involved in making Any23 a better framework and potentially a
> dependency within the semantic web projects within the ASF.
>   however I can't help but see/think that there are areas where we
> (Any23 Jena) can find commonality.

It would be good.  Add Stanbol and the-project-née-Linda.

>> together with a new I/O architecture:
> accepted 100%
>> which is now ready for migrating into the codebase (after a pause due RDF-WG
>> work and non-Apache time).

Now done ...

> accepted 110%
>> In particular, the parser pipeline is have been heavily tuned to get load
>> performance for TDB.  (Long story to do with how Java I/O has hidden costs.)
> Jena framework specific?

Yes and no.

"Yes" -- the parsers use Jena classes but very few.

"no" -- but only as carriers for triples and terms.  Output is to a 
Sink<Triple>, so that can be directly to a graph, a print stream, direct 
to storage (TDB), a stream-filter, whatever.

The carrier objects are from Jena's SPI - AKA the graph API, which is 
just Graph/Triple/Node/DatasetGraph/Quad (+datatypes).

ARP (the RDF/XML parser) does have it's own abstraction of nodes to 
isolate it from the rest of jena.  Once upon a time it did run 
separately (it still can but it's packaged with jena now).  All the RIOT 
parsers are doing is using a zero-copy approach to the same thing. 
Churning objects during n-triples parsing is a measurable cost.  The 
RIOT N-triples parser does about 200K+ triples/s in ideal conditions [2].

The Jena API is built on the SPI - the API is much bigger than the SPI 
which is really quite small and could be smaller.



[2] ideal: server or workstation class PC not doing anything else at the 
time.  No other disk activity, no CPU activity.  Materialise triples but 
send to a Sink that throws everything away.

gzip vs raw expanded file makes a small difference - raw is faster, but 
then very large NT files are often written all in one go so that are 
laid out well on disk for the disk interface to stream and SSDs are not 
that much faster if I/O is not random (I see < x2 faster for > x10 the 
cost mentioned, presumably the x10 is dropping)

>> PS the Turtle parser is compliant with the latest RDF 1.1 spec and the draft
>> RDF 1.1 Turtle test suite.
> Do we have these implementations over @Any23?
> So I suppose the underlying question/conversation/discussion I was
> putting forward concerns where, how and if both projects can benefit?
> We both (communities) have tried to have this before... however now as
> the Scottish National Football team are non-existent, I really have
> nothing to do...
> I know this is not a trivial issue... however I hope we are moving in
> the right direction.


The negative side

>   Lewis

View raw message