incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <>
Subject Re: Jena build (some thoughts)
Date Tue, 06 Sep 2011 19:53:34 GMT

Andy Seaborne wrote:
> On 06/09/11 16:09, Paolo Castagna wrote:
>> Hi Andy
>> thanks for sharing some thoughts.
>> Apologies in advance for my long reply, full of questions.
> Not full answers ... it's some thoughts.
>> Andy Seaborne wrote:
>>> This message is a collection of thoughts on reworking the build process.
>>>   It's not a complete proposal.
>>> == Current Status
>>> Subsystems (to avoid the "module" word)
>>>    Jena, IRI, ARQ, LARQ, TDB, SDB, Fuseki
>> Joseki is not on the list. I imagine because Fuseki replaces it. I
>> also imagine there will be no Apache release for Joseki ever.
>> Correct?
> Maybe - it's about timing.  Fuseki with configurations files = Joseki4.
>>> The current build system is a one maven project per subsystem. Each
>>> subsystem produces a single download zip file and also deploys
>>> artifacts.
>>> The builds are linked by version dependencies in the POM files.  There
>>> is a hack to get ARQ into the Jena download to break the circular
>>> dependency.
>> I have a simple test to decide how to break dependencies: can you use
>> X without Y?
>> Currently, you can use Jena without ARQ.
>> However, you cannot use ARQ without Jena.
>> Therefore, ARQ depends on Jena.
>> Rightly so, Jena's pom.xml file does not have a dependency on ARQ.
>> However, Jena distribution (i.e. the current .zip file) includes ARQ
>> since most of the time people want to run SPARQL queries as well.
>>> == Goals
>>> These are my take on desirable features, not necessarily absolute
>>> requirements, so if it isn't practical to achieve it in the overall
>>> system, then a goal can be modified or removed.
>>   + Creating an Apache Release
> I didn't put that because the message was about the maven structure, not
> the files in each bit etc.
> Does Apache Release force/suggest a particular maven layout?  I'd be
> surprised if it did.

No, it does not.

>>> + Balance cost of change and benefit (we don't have to start with a
>>> clean sheet - we can leave some thigns as they are, because they are).
>>> + A single download zip file  for using Jena as a library
>> I imagine this is not very different from the current Jena
>> distribution as
>> .zip file. Am I right?
>> I would expect to find a lib directory with all the dependent jars as
>> well
>> as jena-x.y.z.jar, arq-x.z.y.jar, iri, etc.
> No. I'm floating the idea there is also one jar in addition to the maven
> artifacts.


It might be easier/better to have a separate module (called jena-dist|
jena-all|jena-one-jar) just to do that. You do something similar in
Fuseki, right?

>> Would that include TDB jar as well?
> "that" = lib/ ?
> No. One jar.

I was asking if the Jena "one jar" includes TDB|SDB classes.

>> Would that include SDB jar as well?
> No. This is the set of things you might want for "normal" use as a
> library in an application.  Adding TDB, now it's got transactions, seems
> to give a useful package of functionality.  SDB would be separate - you
> have to config SQL DBs to use it.


> It's not expert use or fine tuning.
>> Does this imply there is no need for a .zip distribution of ARQ, SDB
>> or TDB?
> Correct.


>> A single download zip file is good: less confusion for people, less work
>> for us (i.e. we just manage a single zip file).
>>> + A single jar file for using Jena as a library
>> Could you be more precise on what this jar would include?
> <assembly>
>   <format>jar
>   <dependencySets>
>     <dependencySet>
>       <includes>jena-core.jar, arq.jar, tdb.jar etc etc
>> Does it include all the necessary runtime dependencies or is it just
>> code we write?
> Just Jena.
>> Does it include ARQ, SDB and TDB?
>> I am not sure who this single jar file is targeted at.
>> Expert developers/users would probably don't like to have a single jar
>> if that includes all the runtime dependencies as well.
>> Expert developers/users would probably don't use that single jar,
>> since sometimes they want to use or test a patched version of just one
>> of the components (i.e. ARQ, IRI, SDB, TDB, etc.)
>> New users would probably download the .zip distribution to use Jena
>> for their first time.
> Or use from maven.
> Add one jar to the classpath, all the right versions checked and merged.

With one jar it will be impossible to make mistakes about versions, I agree.

>> If we think Jena as a library, we should focus on modularity and ease
>> of use with tools such Ivy, Maven, etc. and document this well as well
>> as provide simple examples to start with (as we are trying
>> to do already).
>> Often, I just want to parse a simple Turtle file and I would find it
>> annoying to include a ~20MB jar file just to do that.
>>   "When we started working on Any23, the Sesame library was more
>>    modularized and documented and there was also a full Maven support.
>>    Today much of these reasons are no longer valid."
>>    (from mailing list)
>> I can relate to that sort of comments.
>> Now, Jena offers good|full(?) Maven support.
>> However, I would argue we are still lacking in terms of modularization.
>> If someone does not need an inference engine, or support for RDF/XML
>> parsing, or OWL APIs, it would be good if he/she could use Jena
>> without those parts.
>> Even more so for people wanting to run Jena on "constrained"
>> environments, just to make an example: Android.
> No reason you can't go get the pieces using maven.  It's not either-or
> -- it's as-well-as.


> One jar has been good in Fuseki.

Yes, I agree. (so, long you don't need to patch any of the things Fuseki
is made of). Even if Fuseki is slightly different from Jena: Jena is mostly
a library, Fuseki is a self-contained application you download and run.

>> A single jar file seems to me going against some of these things.
>> So, I'd like to understand more why you propose a "single jar file for
>> using Jena as a library".
>>> The main download should be complete - everything you need to write a
>>> Jena application using Jena as a library.  I'd like to change to having
>>> a single zip that is current Jena + ARQ + LARQ + TDB (maybe?).
>> I think this is a good idea.
>> I would also add SDB to the single zip file. Why not?
>> And, I would remove zip distribution files from ARQ, LARQ (which
>> hasn't got one at the moment), SDB and TDB.
>>> That puts datasets adn quads into Jena core.
>> +1
>>> (Some API changes could also happen to make this feel more integrated.)
>>> It also seems easier to deliver a single jar for this.
>> See above.
>>> And a single obviously-named Maven artifact - at the moment, a single
>>> dependency to pull is e.g. TDB because that pulls in the rest, which
>>> isn't exactly obvious.
>> I am not sure I follow you here. Probably, because I don't understand
>> what you exactly mean with "a single obviously-named Maven artifact".
>> If someone wants to use TDB the obvious dependency to pull in their
>> project
>> is TDB (which depends on ARQ which depends on Jena) and let Maven, or
>> any other tool which can download artifacts from a repository, to
>> resolve the rest of the dependencies (with the right version
>> numbers), currently we have:
>> [INFO] com.hp.hpl.jena:tdb:jar:0.8.11-SNAPSHOT
>> [INFO] +- com.hp.hpl.jena:arq:jar:2.8.9-SNAPSHOT:compile
>> [INFO] |  +- org.codehaus.woodstox:wstx-asl:jar:3.2.9:compile
>> [INFO] |  |  \- stax:stax-api:jar:1.0.1:compile
>> [INFO] |  +- org.apache.lucene:lucene-core:jar:2.3.1:compile
>> [INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
>> [INFO] |  |  \- commons-codec:commons-codec:jar:1.4:compile
>> [INFO] |  \- org.apache.httpcomponents:httpcore:jar:4.1.2:compile
>> [INFO] +- com.hp.hpl.jena:arq:jar:tests:2.8.9-SNAPSHOT:test
>> [INFO] +- com.hp.hpl.jena:jena:jar:2.6.4:compile
>> [INFO] |  +-
>> [INFO] |  \- xerces:xercesImpl:jar:2.7.1:compile
>> [INFO] +- com.hp.hpl.jena:jena:test-jar:tests:2.6.4:test
>> [INFO] +- com.hp.hpl.jena:iri:jar:0.8:compile
>> [INFO] +- junit:junit:jar:4.8.2:test
>> [INFO] +- org.slf4j:slf4j-api:jar:1.6.1:compile
>> [INFO] +- org.slf4j:slf4j-log4j12:jar:1.6.1:compile
>> [INFO] \- log4j:log4j:jar:1.2.16:compile
>> Is something like this what you are proposing? :
>> [INFO] com.hp.hpl.jena:jena-all:jar:x.y.z
>> [INFO] +- org.codehaus.woodstox:wstx-asl:jar:3.2.9:compile
>> [INFO] |  \- stax:stax-api:jar:1.0.1:compile
>> [INFO] +- org.apache.lucene:lucene-core:jar:2.3.1:compile
>> [INFO] +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
>> [INFO] |  \- commons-codec:commons-codec:jar:1.4:compile
>> [INFO] |     \- org.apache.httpcomponents:httpcore:jar:4.1.2:compile
>> [INFO] +-
>> [INFO] +- xerces:xercesImpl:jar:2.7.1:compile
>> [INFO] +- junit:junit:jar:4.8.2:test
>> [INFO] +- org.slf4j:slf4j-api:jar:1.6.1:compile
>> [INFO] +- org.slf4j:slf4j-log4j12:jar:1.6.1:compile
>> [INFO] \- log4j:log4j:jar:1.2.16:compile
>> An jena-all-x.y.z.jar uber jar?
> Don't think "TDB" - think "Jena"
>>> Other goals or considerations?
>> As I've said above: creating a Apache releases should be IMHO the
>> number one priority.
> Does this affect the maven layout?


Maven is an Apache project and a lot of Apache projects use Maven with
the default|recommended Maven layout (we are not that far from that,
even if some of the Jena modules are more 'creative' than others in this

While Apache does not enforce projects to use Maven, it does not
discourage it either. Indeed, it seems to me it is encouraged to a
certain degree:

>From a point of view of users and developers it's a big plus, IMHO. They
come to a project and they already know how to compile, test, package it,
etc. They already know where to look for stuff and how a module is

>From a point of view of the release manager and/or who needs to take care
of the building of various modules, it guarantee a certain degree of
uniformity and coherence (which is not bad at all).

>> As well as "release early, release often", at least at the beginning
>> while we get use to the Apache processes and how to create Apache
>> releases. This is also one of the objectives of the incubation
>> phase and, more importantly, a requirement for graduation.
>> How can we demonstrate our ability to create Apache releases if we
>> release every 6 or 12 months?
>> Or, how long do we expect to be in "incubation"? ;-)
>> Initially, while we get used to the Apache way to cut releases, we
>> should release more often and not be afraid of quickly fixing problems
>> or apply improvements to the release process using minor
>> version numbers _._.x. Nowadays, if a new jar is a drop-in replacement
>> people are willing to (and can easily) upgrade.
>>> == Possible build layout.
>>> Divide the overall project into a number of maven modules for building
>>> parts of the system and a number of projects for making deliverables.
>>> Just the code modules would mean you can work from a set of many jars
>>> and mix and match for development etc.
>> For others, we went down this route already... without using Maven (i.e.
>> using Ant + Ivy, and it has been painful). The result is here:
>> It was a good experience to see what's necessary to separate a "core"
>> from RDF/XML parsing, etc. and to think about what a minimal system
>> would include.
>>> Application writers can get a single consolidated jar.
>> Here, again, I don't understand who you mean with "application
>> writers" (is it us? are other companies using Jena? are University
>> students? are other Apache committers?) and why the would benefit from
>> a "single consolidated jar".
>>> Jena-top-POM -- common declarations, a lot of properties getting set.
>> +1
>> I think having a Jena specific parent pom.xml file is good idea (and a
>> best practice).
>> We usually do this for our internal projects @ Talis.
>> Large organizations have a corporate parent pom as well (which for us
>> should be this:
>>> Code modules:
>>>    JenaSys
>>>      -- This is the current Jena2.
>>>        How much do we want to split it up?
>>>        Is it worth the effort?
>>>          core = graph + datatypes
>> +1 (== I think is a good idea and I am prepared to help here) on a
>> small/minimal Jena core module.
>> I often need just this (+ RIOT below) and I imagine it would make a
>> lot of people and projects (such as Any23, just to make an example)
>> happy.
>> Another use case: launching MapReduce jobs with ~20MB jar files is a
>> bit of a pain, you often need just core + RIOT (to parse
>> N-Triples|N-Quads files) there.
>>>          RDF API (inc enhanced?)
>>>          owlapi
>>>          rules
>>>          Assembler? Here? Module of it's own?
>>>    IRI
>>>    Atlas -- Non RDF specific stuff.
>> +1 (== ditto) on separating out Atlas from ARQ.
>>>    RIOT
>>>      -- ideally Jena-code+RIOT is a useful set
>> +1
>>>      -- Move ARP and XML output here or separate module again.
>> As a separate module, it changes much less often than RIOT.
>> Another use case: apparently you cannot run with Xerces on Android.
>> This caused problems to people wanting to use Jena on Android.
>>>    ARQ
>>>     -- minus atlas, and RIOT
>> +1
>>>    TDB -- transactional
>>>    SDB
>>>    RDB? Legacy or remove once and for all.
>> +1 on having RDB as a deprecated separate module.
>>>    Documentation = website only.
>> +1
>> The disadvantage is that on the website you only have the most recent
>> documentation, not the one corresponding to the (maybe obsolete
>> version of Jena) you might be using.
>> However, since Jena is quite stable now, I don't think this will be a
>> problem (and we can always revisit/change this in future).
>>> Deliver modules:
>>>    Jena  -- the deliverable: one jena-the-jar and zip file.
>>>    JenaCmd -- Command line things: Jena+ARQ+TDB commands
>> Maven artifacts, IMHO, should be included in the list of
>> "deliverables" of a Jena release (although only what we will be
>> putting here has 'legal' value in
>> Apache).
> Clearly - a maven module produces maven artifacts.  Each of jena, arq,
> tdb etc etc still produces and deploys it's own jar.
> It gets repacked *as well* into convenient forms.


>> Rationale to consider also Maven artifacts as first-class deliverables
>> of a Jena release is: Jena is 'mostly' a library which people use to
>> write applications and modern building tools/systems (such
>> as Maven, but not only that) have dependencies engine to transitively
>> resolve dependencies as well as on-line repositories where developers
>> can easily find artifacts (including sources and test
>> packages). Once you have that, you rarely manually download a .zip or
>> .tar.gz as developer.
>> ... and we all felt the pain of failing to find an artifact of a
>> library we want to use.
>>> Fuseki is a separate module and deliverable.  It uses combined Jena as a
>>> dependency but does not need to be part of the library build.
>> I agree.
>> Fuseki is something an end-user wants to: download, unzip, (load data)
>> and run.
>>> Eyeball is a separate module and deliverable.  It uses combined Jena as
>>> a dependency but does not need to be part of the library build.
>> I mostly agree.
>> I've never used Eyeball much, however someone might want to
>> include/use Eyeball in their application (with additional/custom
>> extensions/checks).
>> For this reason, I would argue it's not be a bad idea to have Eyeball
>> artifacts (i.e. an eyeball-x.y.z.jar) published as Maven artifact on
>> the Apache Maven Repository.
>>> === Questions and notes.
>>> 1/ We currently make some attempt to deliver the test suite in the zip
>>> so people can locally run it to check an installation.  From memory, the
>>> only thing this seems to catch is problems running the test suite, not
>>> problems with installation.  Maybe it's not worth the effort.
>> +1 on removing
>> Rationale: if you want to run the test suite you should be able to
>> checkout a tagged source tree and type mvn test. For example:
>>    svn co
>> arq
>>    cd arq
>>    mvn test
>> This is a much better way to let people run the test suite on their
>> system (i.e. different OS, different JVM, etc.)
>> I do agree that it's not the exactly the same as running the test
>> suite against arq-x.y.z.jar, but how many other Apache projects do you
>> know who are doing this? ;-)
>> However, it is sometimes useful to publish the test suite as Maven
>> artifact. This way people can specify a dependency on that and reuse
>> tests or utilities we have in our test suites elsewhere. This is
>> the reason why, for example, we have
>> (as well as:
>> I consider this a good practice and, if possible, I'd like to keep it.
>> The ideal situation (and best practice) would be to have the files
>> necessary to run the test suite included in that jar (i.e.
>> arq-x.y.z-tests.jar). Maven has support for that, but people need to use
>> getSystemResourceAsStream() to read test files (as I am sure you
>> know). At development time, those files must be in src/test/resources
>> (for example, LARQ does this:
>> This would be my favorite option, but it requires some changes.
>>> 2/ The Apache top level POM has a list of versioned plugins in it which
>>> we'd inherit.  Hopefully it helps with an Apach release but it does seem
>>> quite a lot.  The default compilation is Java 1.4 -- we need to check
>>> details.
>> LARQ pom.xml file, for example, has this:
>>    <parent>
>>      <groupId>org.apache</groupId>
>>      <artifactId>apache</artifactId>
>>      <version>9</version>
>>    </parent>
>> However, it specifies Java 1.6 for compiling:
>>        <plugin>
>>          <groupId>org.apache.maven.plugins</groupId>
>>          <artifactId>maven-compiler-plugin</artifactId>
>>          <configuration>
>>            <source>${jdk.version}</source>
>>            <target>${jdk.version}</target>
>>            <encoding>${}</encoding>
>>          </configuration>
>>        </plugin>
>> You can verify the effective pom.xml file using: mvn help:effective-pom
> My point was we need to be careful.


>> So, technically, the fact that Apache parent pom.xml has Java 1.4 as
>> default compilation isn't an issue.
>> I've not found problems with it, so far. This does not mean there
>> aren't any... but we should be able to override any behavior we don't
>> like it and we control if/when upgrade from a version to another.
>> I think we will be better off in having the org.apache:apache:9 as
>> parent pom (directly or via our own parent pom), as suggested here:
>>> 3/ For RDB, I propose creating a maven module and putting the code here
>>> with a dependency of whatever version of Jena it is at the time then
>>> leaving it frozen.  Alternatively, zip up the code and dump somewhere in
>>> case anyone wants to port it.
>> +1 on having RDB as separate module (depending on Jena).
>>> 4/ Shall we leave the documentation out of the build and just have it on
>>> the website?
>> What about javadocs?
> All maven artifacts should have javadocs and source available.
> I really don't understand projects that don't put -sources up as well.

Laziness and ignoring the needs of developers (as well as not using Maven
because it sucks, but other options are not as friendly when it comes to
publishing artifacts (with -sources)).

Plenty of examples around.

>  But then, I strongly prefer to attach the sources to the javadocs.

Me too.

Sources are a must.

Javadocs optional. What's not clear to me is if you are proposing to stop
publishing javadocs as Maven artifacts.

One thing I find annoying is that when you generate Eclipse files via
mvn eclipse:eclipse, even if you specify to exclude Javadocs if the javadoc
artifacts are in your Maven cache the .classpath will point at those.
Even worst, the path is absolute not relative to your Maven repository
(i.e. the configuration for Javadocs in the resulting .classpath is not
portable from one machine to another).

>>> 5/ Jump to maven 3?
>> Not sure why are you asking this.
>> I am still using Maven v2.x.y on my desktop (without problems) but we
>> are using Maven v3.0.3 with some of our modules on Jenkins
>> ( currently (let's cross
>> fingers) with no problems (and it should be more stable in the future).
> """
> While Maven 3 aims to be backward-compatible with Maven 2.x to the
> extent possible, there are still a few significant changes.
> """


So far, I did not have big problems with Maven 3 and our current pom.xml files.
Is there a particular feature we want to use which comes with Maven 3 only?

I would prefer to keep our pom.xml files into the common parts of Maven 2 and 3
so that we just require Maven >= 2 to compile, test, package from developers
and potential future committers who might need to test their patches locally.

Looking here:

These are relevant changes:

 - Automatic Plugin Version Resolution (=> we should specify the version
   of the plugins we use)
 - Dependency Resolution (this does not seem a big improvement, for us...
   indeed the fact that mvn dependency:tree does not show the real classpath
   is a big disadvantage in terms of usability, IMHO)
 - Stricter POM Validation (we see these in Jenkins where we can easily set
   which version of Maven to use)
 - Parent POM Resolution (this could impact us... or not)
 - ...

We should also check this:



>> Paolo

View raw message