incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <>
Subject Re: Jena build (some thoughts)
Date Tue, 06 Sep 2011 15:09:10 GMT
Hi Andy
thanks for sharing some thoughts.

Apologies in advance for my long reply, full of questions.

Andy Seaborne wrote:
> This message is a collection of thoughts on reworking the build process.
>  It's not a complete proposal.
> == Current Status
> Subsystems (to avoid the "module" word)
>   Jena, IRI, ARQ, LARQ, TDB, SDB, Fuseki

Joseki is not on the list. I imagine because Fuseki replaces it. I also imagine there will
be no Apache release for Joseki ever.

> The current build system is a one maven project per subsystem. Each
> subsystem produces a single download zip file and also deploys artifacts.
> The builds are linked by version dependencies in the POM files.  There
> is a hack to get ARQ into the Jena download to break the circular
> dependency.

I have a simple test to decide how to break dependencies: can you use X without Y?

Currently, you can use Jena without ARQ.
However, you cannot use ARQ without Jena.
Therefore, ARQ depends on Jena.

Rightly so, Jena's pom.xml file does not have a dependency on ARQ. However, Jena distribution
(i.e. the current .zip file) includes ARQ since most of the time people want to run SPARQL
queries as well.

> == Goals
> These are my take on desirable features, not necessarily absolute
> requirements, so if it isn't practical to achieve it in the overall
> system, then a goal can be modified or removed.

 + Creating an Apache Release

> + Balance cost of change and benefit (we don't have to start with a
> clean sheet - we can leave some thigns as they are, because they are).
> + A single download zip file  for using Jena as a library

I imagine this is not very different from the current Jena distribution as
.zip file. Am I right?

I would expect to find a lib directory with all the dependent jars as well
as jena-x.y.z.jar, arq-x.z.y.jar, iri, etc.

Would that include TDB jar as well?

Would that include SDB jar as well?

Does this imply there is no need for a .zip distribution of ARQ, SDB or TDB?

A single download zip file is good: less confusion for people, less work
for us (i.e. we just manage a single zip file).

> + A single jar file for using Jena as a library

Could you be more precise on what this jar would include?

Does it include all the necessary runtime dependencies or is it just code we write?

Does it include ARQ, SDB and TDB?

I am not sure who this single jar file is targeted at.
Expert developers/users would probably don't like to have a single jar if that includes all
the runtime dependencies as well.
Expert developers/users would probably don't use that single jar, since sometimes they want
to use or test a patched version of just one of the components (i.e. ARQ, IRI, SDB, TDB, etc.)
New users would probably download the .zip distribution to use Jena for their first time.

If we think Jena as a library, we should focus on modularity and ease of use with tools such
Ivy, Maven, etc. and document this well as well as provide simple examples to start with (as
we are trying
to do already).
Often, I just want to parse a simple Turtle file and I would find it annoying to include a
~20MB jar file just to do that.

 "When we started working on Any23, the Sesame library was more
  modularized and documented and there was also a full Maven support.
  Today much of these reasons are no longer valid."
  (from mailing list)

I can relate to that sort of comments.

Now, Jena offers good|full(?) Maven support.
However, I would argue we are still lacking in terms of modularization.

If someone does not need an inference engine, or support for RDF/XML parsing, or OWL APIs,
it would be good if he/she could use Jena without those parts.
Even more so for people wanting to run Jena on "constrained" environments, just to make an
example: Android.

A single jar file seems to me going against some of these things.
So, I'd like to understand more why you propose a "single jar file for using Jena as a library".

> The main download should be complete - everything you need to write a
> Jena application using Jena as a library.  I'd like to change to having
> a single zip that is current Jena + ARQ + LARQ + TDB (maybe?).  

I think this is a good idea.

I would also add SDB to the single zip file. Why not?

And, I would remove zip distribution files from ARQ, LARQ (which hasn't got one at the moment),
SDB and TDB.

> That puts datasets adn quads into Jena core.


> (Some API changes could also happen to make this feel more integrated.)
> It also seems easier to deliver a single jar for this.

See above.

> And a single obviously-named Maven artifact - at the moment, a single
> dependency to pull is e.g. TDB because that pulls in the rest, which
> isn't exactly obvious.

I am not sure I follow you here. Probably, because I don't understand what you exactly mean
with "a single obviously-named Maven artifact".

If someone wants to use TDB the obvious dependency to pull in their project
is TDB (which depends on ARQ which depends on Jena) and let Maven, or any other tool which
can download artifacts from a repository, to resolve the rest of the dependencies (with the
right version
numbers), currently we have:

[INFO] com.hp.hpl.jena:tdb:jar:0.8.11-SNAPSHOT
[INFO] +- com.hp.hpl.jena:arq:jar:2.8.9-SNAPSHOT:compile
[INFO] |  +- org.codehaus.woodstox:wstx-asl:jar:3.2.9:compile
[INFO] |  |  \- stax:stax-api:jar:1.0.1:compile
[INFO] |  +- org.apache.lucene:lucene-core:jar:2.3.1:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
[INFO] |  |  \- commons-codec:commons-codec:jar:1.4:compile
[INFO] |  \- org.apache.httpcomponents:httpcore:jar:4.1.2:compile
[INFO] +- com.hp.hpl.jena:arq:jar:tests:2.8.9-SNAPSHOT:test
[INFO] +- com.hp.hpl.jena:jena:jar:2.6.4:compile
[INFO] |  +-
[INFO] |  \- xerces:xercesImpl:jar:2.7.1:compile
[INFO] +- com.hp.hpl.jena:jena:test-jar:tests:2.6.4:test
[INFO] +- com.hp.hpl.jena:iri:jar:0.8:compile
[INFO] +- junit:junit:jar:4.8.2:test
[INFO] +- org.slf4j:slf4j-api:jar:1.6.1:compile
[INFO] +- org.slf4j:slf4j-log4j12:jar:1.6.1:compile
[INFO] \- log4j:log4j:jar:1.2.16:compile

Is something like this what you are proposing? :

[INFO] com.hp.hpl.jena:jena-all:jar:x.y.z
[INFO] +- org.codehaus.woodstox:wstx-asl:jar:3.2.9:compile
[INFO] |  \- stax:stax-api:jar:1.0.1:compile
[INFO] +- org.apache.lucene:lucene-core:jar:2.3.1:compile
[INFO] +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
[INFO] |  \- commons-codec:commons-codec:jar:1.4:compile
[INFO] |     \- org.apache.httpcomponents:httpcore:jar:4.1.2:compile
[INFO] +-
[INFO] +- xerces:xercesImpl:jar:2.7.1:compile
[INFO] +- junit:junit:jar:4.8.2:test
[INFO] +- org.slf4j:slf4j-api:jar:1.6.1:compile
[INFO] +- org.slf4j:slf4j-log4j12:jar:1.6.1:compile
[INFO] \- log4j:log4j:jar:1.2.16:compile

An jena-all-x.y.z.jar uber jar?

> Other goals or considerations?

As I've said above: creating a Apache releases should be IMHO the number one priority.

As well as "release early, release often", at least at the beginning while we get use to the
Apache processes and how to create Apache releases. This is also one of the objectives of
the incubation
phase and, more importantly, a requirement for graduation.

How can we demonstrate our ability to create Apache releases if we release every 6 or 12 months?
Or, how long do we expect to be in "incubation"? ;-)

Initially, while we get used to the Apache way to cut releases, we should release more often
and not be afraid of quickly fixing problems or apply improvements to the release process
using minor
version numbers _._.x. Nowadays, if a new jar is a drop-in replacement people are willing
to (and can easily) upgrade.

> == Possible build layout.
> Divide the overall project into a number of maven modules for building
> parts of the system and a number of projects for making deliverables.
> Just the code modules would mean you can work from a set of many jars
> and mix and match for development etc. 

For others, we went down this route already... without using Maven (i.e.
using Ant + Ivy, and it has been painful). The result is here:

It was a good experience to see what's necessary to separate a "core" from RDF/XML parsing,
etc. and to think about what a minimal system would include.

> Application writers can get a single consolidated jar.

Here, again, I don't understand who you mean with "application writers" (is it us? are other
companies using Jena? are University students? are other Apache committers?) and why the would
benefit from
a "single consolidated jar".

> Jena-top-POM -- common declarations, a lot of properties getting set.


I think having a Jena specific parent pom.xml file is good idea (and a best practice).
We usually do this for our internal projects @ Talis.

Large organizations have a corporate parent pom as well (which for us should be this:

> Code modules:
>   JenaSys
>     -- This is the current Jena2.
>       How much do we want to split it up?
>       Is it worth the effort?
>         core = graph + datatypes

+1 (== I think is a good idea and I am prepared to help here) on a small/minimal Jena core

I often need just this (+ RIOT below) and I imagine it would make a lot of people and projects
(such as Any23, just to make an example) happy.
Another use case: launching MapReduce jobs with ~20MB jar files is a bit of a pain, you often
need just core + RIOT (to parse N-Triples|N-Quads files) there.

>         RDF API (inc enhanced?)
>         owlapi
>         rules
>         Assembler? Here? Module of it's own?
>   IRI
>   Atlas -- Non RDF specific stuff.

+1 (== ditto) on separating out Atlas from ARQ.

>   RIOT
>     -- ideally Jena-code+RIOT is a useful set


>     -- Move ARP and XML output here or separate module again.

As a separate module, it changes much less often than RIOT.
Another use case: apparently you cannot run with Xerces on Android. This caused problems to
people wanting to use Jena on Android.

>   ARQ
>    -- minus atlas, and RIOT


>   TDB -- transactional
>   SDB
>   RDB? Legacy or remove once and for all.

+1 on having RDB as a deprecated separate module.

>   Documentation = website only.


The disadvantage is that on the website you only have the most recent documentation, not the
one corresponding to the (maybe obsolete version of Jena) you might be using.
However, since Jena is quite stable now, I don't think this will be a problem (and we can
always revisit/change this in future).

> Deliver modules:
>   Jena  -- the deliverable: one jena-the-jar and zip file.
>   JenaCmd -- Command line things: Jena+ARQ+TDB commands

Maven artifacts, IMHO, should be included in the list of "deliverables" of a Jena release
(although only what we will be putting here has 'legal' value
in Apache).

Rationale to consider also Maven artifacts as first-class deliverables of a Jena release is:
Jena is 'mostly' a library which people use to write applications and modern building tools/systems
as Maven, but not only that) have dependencies engine to transitively resolve dependencies
as well as on-line repositories where developers can easily find artifacts (including sources
and test
packages). Once you have that, you rarely manually download a .zip or .tar.gz as developer.

... and we all felt the pain of failing to find an artifact of a library we want to use.

> Fuseki is a separate module and deliverable.  It uses combined Jena as a
> dependency but does not need to be part of the library build.

I agree.

Fuseki is something an end-user wants to: download, unzip, (load data) and run.

> Eyeball is a separate module and deliverable.  It uses combined Jena as
> a dependency but does not need to be part of the library build.

I mostly agree.

I've never used Eyeball much, however someone might want to include/use Eyeball in their application
(with additional/custom extensions/checks).
For this reason, I would argue it's not be a bad idea to have Eyeball artifacts (i.e. an eyeball-x.y.z.jar)
published as Maven artifact on the Apache Maven Repository.

> === Questions and notes.
> 1/ We currently make some attempt to deliver the test suite in the zip
> so people can locally run it to check an installation.  From memory, the
> only thing this seems to catch is problems running the test suite, not
> problems with installation.  Maybe it's not worth the effort.

+1 on removing

Rationale: if you want to run the test suite you should be able to checkout a tagged source
tree and type mvn test. For example:

  svn co arq
  cd arq
  mvn test

This is a much better way to let people run the test suite on their system (i.e. different
OS, different JVM, etc.)

I do agree that it's not the exactly the same as running the test suite against arq-x.y.z.jar,
but how many other Apache projects do you know who are doing this? ;-)

However, it is sometimes useful to publish the test suite as Maven artifact. This way people
can specify a dependency on that and reuse tests or utilities we have in our test suites elsewhere.
This is
the reason why, for example, we have
(as well as: I consider
this a good practice and, if possible, I'd like to keep it.

The ideal situation (and best practice) would be to have the files necessary to run the test
suite included in that jar (i.e. arq-x.y.z-tests.jar). Maven has support for that, but people
need to use
getSystemResourceAsStream() to read test files (as I am sure you know). At development time,
those files must be in src/test/resources (for example, LARQ does this: This
would be my favorite option, but it requires some changes.

> 2/ The Apache top level POM has a list of versioned plugins in it which
> we'd inherit.  Hopefully it helps with an Apach release but it does seem
> quite a lot.  The default compilation is Java 1.4 -- we need to check
> details.

LARQ pom.xml file, for example, has this:


However, it specifies Java 1.6 for compiling:


You can verify the effective pom.xml file using: mvn help:effective-pom

So, technically, the fact that Apache parent pom.xml has Java 1.4 as default compilation isn't
an issue.
I've not found problems with it, so far. This does not mean there aren't any... but we should
be able to override any behavior we don't like it and we control if/when upgrade from a version
to another.

I think we will be better off in having the org.apache:apache:9 as parent pom (directly or
via our own parent pom), as suggested here:

> 3/ For RDB, I propose creating a maven module and putting the code here
> with a dependency of whatever version of Jena it is at the time then
> leaving it frozen.  Alternatively, zip up the code and dump somewhere in
> case anyone wants to port it.

+1 on having RDB as separate module (depending on Jena).

> 4/ Shall we leave the documentation out of the build and just have it on
> the website?

What about javadocs?

> 5/ Jump to maven 3?

Not sure why are you asking this.

I am still using Maven v2.x.y on my desktop (without problems) but we are using Maven v3.0.3
with some of our modules on Jenkins ( currently
(let's cross
fingers) with no problems (and it should be more stable in the future).


View raw message