Sure np, took me a while to get around to writing it too ;-)


On Oct 6, 2008, at 10:24 PM, Jason Warner wrote:

Just got around to reading this.  Thanks for the brain dump, Jason.  No questions as of yet, but I'm sure I'll need a few more reads before I understand it all. 

On Thu, Oct 2, 2008 at 2:34 PM, Jason Dillon <> wrote:
On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:

Is the GBuild stuff in svn the same as the anthill-based code or is that something different?  GBuild seems to have scripts for running tck and that leads me to think they're the same thing, but I see no mention of anthill in the code.

The Anthill stuff is completely different than the GBuild stuff.  I started out trying to get the TCK automated using GBuild, but decided that the system lacked too many features to perform as I desired, and went ahead with Anthill as it did pretty much everything, though had some stability problems.

One of the main reasons why I choose Anthill (AHP, Anthill Pro that is) was its build agent and code repository systems.  This allowed me to ensure that each build used exactly the desired artifacts.  Another was the configurable workflow, which allowed me to create a custom chain of events to handle running builds on remote agents and control what data gets set to them, what it will collect and what logic to execute once all distributed work has been completed for a particular build.  And the kicker which help facilitate bringing it all together was its concept of a build life.

At the time I could find *no other* build tool which could meet all of these needs, and so I went with AHP instead of spending months building/testing features in GBuild.

While AHP supports configuring a lot of stuff via its web-interface, I found that it was very cumbersome, so I opted to write some glue, which was stored in svn here:

Its been a while, so I have to refresh my memory on how this stuff actually worked.  First let me explain about the code repository (what it calls codestation) and why it was critical to the TCK testing IMO.  When we use Maven normally, it pulls data from a set of external repositories, picks up more repositories from the stuff it downloads and quickly we loose control where stuff comes from.  After it pulls down all that stuff, it churns though a build and spits out the stuff we care about, normally stuffing them (via mvn install) into the local repository.

AHP supports by default tasks to publish artifacts (really just a set of files controlled by an Ant-like include/exclude path) from a build agent into Codestation, as well as tasks to resolve artifacts (ie. download them from Codestation to the local working directory on the build agents system).  Each top-level build in AHP gets assigned a new (empty) build life.  Artifacts are always published to/resolved from a build life, either that of the current build, or of a dependency build.

So what I did was I setup builds for Geronimo Server (the normal server/trunk stuff), which did the normal mvn install thingy, but I always gave it a custom -Dmaven.local.repository which resolved to something inside the working directory for the running build.  The build was still online, so it pulled down a bunch of stuff into an empty local repository (so it was a clean build wrt the repository, as well as the source code, which was always fetched for each new build).  Once the build had finished, I used the artifact publisher task to push *all* of the stuff in the local repository into Codestation, labled as something like "Maven repository artifacts" for the current build life.

Then I setup another build for Apache Geronimo CTS Server (the porting/branches/* stuff).  This build was dependent upon the "Maven repository artifacts" of the Geronimo Server build, and I configured those artifacts to get installed on the build agents system in the same directory that I configured the CTS Server build to use for its local maven repository.  So again the repo started out empty, then got populated with all of the outputs from the normal G build, and then the cts-server build was started.  The build of the components and assemblies is normally fairly quick and aside from some stuff in the private tck repo won't download muck more stuff, because it already had most of its dependencies installed via the Codestation dependency resolution.   Once the build finished, I published to cts-server assembly artifacts back to Codestation under like "CTS Server Assemblies" or something.

Up until this point its normal builds, but now we have built the G server, then built the CTS server (using the *exact* artifacts from the G server build, even though each might have happened on a different build agent).  And now we need to go and run a bunch of tests, using the *exact* CTS server assemblies, produce some output, collect it, and once all of the tests are done render some nice reports, etc.

AHP supports setting up builds which contain "parallel" tasks, each of those tasks is then performed by a build agent, they have fancy build agent selection stuff, but for my needs I had basically 2 groups, one group for running the server builds, and then another for running the tests.  I only set aside like 2 agents for builds and the rest for tests.  Oh, I forgot to mention that I had 2 16x 16g AMD beasts all running CentOS 5, each with about 10-12 Xen virtual machines running internally to run build agents.  Each system also had a RAID-0 array setup over 4 disks to help reduce disk io wait, which was as I found out the limiting factor when trying to run a ton of builds that all checkout and download artifacts and such.

I helped the AHP team add a new feature which was an parallel iterator task, so you define *one* task that internally fires off n parallel tasks, which would set the iteration number, and leave it up to the build logic to pick what to do based on that index.  The alternative was a unwieldy set of like 200 tasks in their UI which simply didn't work at all.  You might have notice an "iterations.xml" file in the tck-testsuite directory, this was was was used to take an iteration number and turn it into what tests we actually run.  The <iteration> bits are order sensitive in that file.

Soooo, after we have a CTS Server for a particular G Server build, we can no go an do "runtests" for a specific set of tests (defined by an iteration)... this differed from the other builds above a little, but still pulled down artifacts, the CTS Server assemblies (only the assemblies and the required bits to run the geronimo-maven-plugin, which was used to geronimo:install, as well as used by the tck itself to fire up the server and so on).  The key thing here, with regards to the maven configuration (besides using that custom Codestation populated repository) was that the builds were run *offline*.

After runtests completed, the results are then soaked up (the stuff that javatest pukes out with icky details, as well as the full log files and other stuff I can recall) and then pushed back into Codestation.

Once all of the iterations were finished, another task fires off which generates a report.  It does this by downloading from Codestation all of the runtests outputs (each was zipped I think), unzips them one by one, run some custom goo I wrote (based some of the concepts from original stuff from the GBuild-based TCK automation), and generates a nice Javadoc-like report that includes all of the gory details.

I can't remember how long I spent working on this... too long (not the reports I mean, the whole system).  But in the end I recall something like running an entire TCK testsuite for a single server configuration (like jetty) in about 4-6 hours... I sent mail to the list with the results, so if you are curious what the real number is, instead of my guess, you can look for it there.  But anyway it was damn quick running on just those 2 machines.  And I *knew* exactly that each of the distributed tests was actually testing a known build that I could trace back to its artifacts and then back to its SVN revision, without worrying about mvn downloading something new when midnight rolled over or that a new G server or CTS server build that might be in progress hasn't compromised the testing by polluting the local repository.

 * * *

So, about the sandbox/build-support stuff...

First there is the 'harness' project, which is rather small, but contains the basic stuff, like a version of ant and maven which all of these builds would use, some other internal glue, a  fix for an evil Maven problem causing erroneous build failures due to some internal thread state corruption or gremlins, not sure which.  I kinda used this project to help manage the software needed by normal builds, which is why Ant and Maven were in there... ie. so I didn't have to go install it on each agent each time it changed, just let the AHP system deal with it for me.

This was setup as a normal AHP project, built using its internal Ant builder (though having that builder configured still to use the local version it pulled from SVN to ensure it always works.

Each other build was setup to depend on the output artifacts from the build harness build, using the latest in a range, like say using "3.*" for the latest 3.x build (which looks like that was 3.7).  This let me work on new stuff w/o breaking the current builds as I hacked things up.

So, in addition to all of the stuff I mentioned above wrt the G and CTS builds, each also had this step which resolved the build harness artifacts to that working directory, and the Maven builds were always run via the version of Maven included from the harness.  But, AHP didn't actually run that version of Maven directly, it used its internal Ant task to execute the version of Ant from the harness *and* use the harness.xml buildfile.

The harness.xml stuff is some more goo which I wrote to help mange AHP configurations.  With AHP (at that time, not sure if it has changed) you had to do most everything via the web UI, which sucked, and it was hard to refactor sets of projects and so on.  So I came up with a standard set of tasks to execute for a project, then put all of the custom muck I needed into what I called a _library_ and then had the AHP via harness.xml invoke it with some configuration about what project it was and other build details.

The actual harness.xml is not very big, it simply makes sure that */bin/* is executable (codestation couldn't preserve execute bits), uses the Codestation command-line client (invoking the javaclass directly though) to ask the repository to resolve artifacts from the "Build Library" to the local repository.  I had this artifact resolution separate from the normal dependency (or harness) artifact resolution so that it was easier for me to fix problems with the library while a huge set of TCK iterations were still queued up to run.  Basically, if I noticed a problem due to a code or configuration issue in an early build, I could fix it, and use the existing builds to verify the fix, instead of wasting an hour (sometimes more depending on networking problems accessing remote repos while building the servers) to rebuild and start over.

This brings us to the 'libraries' project.  In general the idea of a _library_ was just a named/versioned collection of files, where you could be used by a project.  The main (er only) library defined in this SVN is system/.  This is the groovy glue which made everything work.  This is where the entry-point class is located (the guy who gets invoked via harness.xml via:

   <target name="harness" depends="init">
               <pathelement location="${library.basedir}/groovy"/>


I won't go into too much detail on this stuff now, take a look at it and ask questions.  But, basically there is stuff in gbuild.system.* which is harness support muck, and stuff in gbuild.config.* which contains configuration.  I was kinda mid-refactoring of some things, starting to add new features, not sure where I left off actually. But the key bits are in gbuild.config.project.*  This contains a package for each project, with the package name being the same as the AHP project (with " " -> "_"). And then in each of those package is at least a Controller.groovy class (or other classes if special muck was needed, like for the report generation in Geronimo_CTS, etc).

The controller defines a set of actions, implemented as Groovy closures bound to properties of the Controller class.  One of the properties passed in from the AHP configuration (configured via the Web UI, passed to the harness.xml build, and then on to the Groovy harness) was the name of the _action_ to execute.  Most of that stuff should be fairly straightforward.

So after a build is started (maybe from a Web UI click, or SVN change detection, or a TCK runtests iteration) the following happens (in simplified terms):

 * Agent starts build
 * Agent cleans its working directory
 * Agent downloads the build harness
 * Agent downloads any dependencies
 * Agent invoke Ant on harness.xml passing in some details
 * Harness.xml downloads the system/1 library
 * Harness.xml runs gbuild.system.BuildHarness
 * BuildHarness tries to construct a Controller instance for the project
 * BuildHarness tries to find Controller action to execute
 * BuildHarness executes the Controller action
 * Agent publishes output artifacts
 * Agent completes build

A few extra notes on libraries, the JavaEE TCK requires a bunch of stuff we get from Sun to execute.  This stuff isn't small, but is for the most part read-only.  So I setup a location on each build agent where these files were installed to.  I created AHP projects to manage them and treated them like a special "library" one which tried really hard not to go fetch its content unless the local content was out of date.  This helped speed up the entire build process... cause that delete/download of all that muck really slows down 20 agents running in parallel on 2 big machines with stripped array.  For legal reasons this stuff was not kept in's main repository, and for logistical reasons wasn't kept in the private tck repo on either.  Because there were so many files, and be case the httpd configuration on kicks out requests that it thinks are *bunk* to help save the resources for the community, I had setup a private ssl secured private svn repository on the old machines to put in the full muck required, then setup some goo in the harness to resolve them.  This goo is all in gbuild.system.library.*  See the gbuild.config.projects.Geronimo_CTS.Controller for more of how it was actually used.

 * * *

Okay, that is about all the brain-dump for TCK muck I have in me for tonight.  Reply with questions if you have any.



~Jason Warner