geronimo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Dillon <jason.dil...@gmail.com>
Subject Re: Continuous TCK Testing
Date Thu, 02 Oct 2008 18:34:36 GMT
On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:

> Is the GBuild stuff in svn the same as the anthill-based code or is  
> that something different?  GBuild seems to have scripts for running  
> tck and that leads me to think they're the same thing, but I see no  
> mention of anthill in the code.

The Anthill stuff is completely different than the GBuild stuff.  I  
started out trying to get the TCK automated using GBuild, but decided  
that the system lacked too many features to perform as I desired, and  
went ahead with Anthill as it did pretty much everything, though had  
some stability problems.

One of the main reasons why I choose Anthill (AHP, Anthill Pro that  
is) was its build agent and code repository systems.  This allowed me  
to ensure that each build used exactly the desired artifacts.  Another  
was the configurable workflow, which allowed me to create a custom  
chain of events to handle running builds on remote agents and control  
what data gets set to them, what it will collect and what logic to  
execute once all distributed work has been completed for a particular  
build.  And the kicker which help facilitate bringing it all together  
was its concept of a build life.

At the time I could find *no other* build tool which could meet all of  
these needs, and so I went with AHP instead of spending months  
building/testing features in GBuild.

While AHP supports configuring a lot of stuff via its web-interface, I  
found that it was very cumbersome, so I opted to write some glue,  
which was stored in svn here:

     https://svn.apache.org/viewvc/geronimo/sandbox/build-support/?pathrev=632245

Its been a while, so I have to refresh my memory on how this stuff  
actually worked.  First let me explain about the code repository (what  
it calls codestation) and why it was critical to the TCK testing IMO.   
When we use Maven normally, it pulls data from a set of external  
repositories, picks up more repositories from the stuff it downloads  
and quickly we loose control where stuff comes from.  After it pulls  
down all that stuff, it churns though a build and spits out the stuff  
we care about, normally stuffing them (via mvn install) into the local  
repository.

AHP supports by default tasks to publish artifacts (really just a set  
of files controlled by an Ant-like include/exclude path) from a build  
agent into Codestation, as well as tasks to resolve artifacts (ie.  
download them from Codestation to the local working directory on the  
build agents system).  Each top-level build in AHP gets assigned a new  
(empty) build life.  Artifacts are always published to/resolved from a  
build life, either that of the current build, or of a dependency build.

So what I did was I setup builds for Geronimo Server (the normal  
server/trunk stuff), which did the normal mvn install thingy, but I  
always gave it a custom -Dmaven.local.repository which resolved to  
something inside the working directory for the running build.  The  
build was still online, so it pulled down a bunch of stuff into an  
empty local repository (so it was a clean build wrt the repository, as  
well as the source code, which was always fetched for each new  
build).  Once the build had finished, I used the artifact publisher  
task to push *all* of the stuff in the local repository into  
Codestation, labled as something like "Maven repository artifacts" for  
the current build life.

Then I setup another build for Apache Geronimo CTS Server (the porting/ 
branches/* stuff).  This build was dependent upon the "Maven  
repository artifacts" of the Geronimo Server build, and I configured  
those artifacts to get installed on the build agents system in the  
same directory that I configured the CTS Server build to use for its  
local maven repository.  So again the repo started out empty, then got  
populated with all of the outputs from the normal G build, and then  
the cts-server build was started.  The build of the components and  
assemblies is normally fairly quick and aside from some stuff in the  
private tck repo won't download muck more stuff, because it already  
had most of its dependencies installed via the Codestation dependency  
resolution.   Once the build finished, I published to cts-server  
assembly artifacts back to Codestation under like "CTS Server  
Assemblies" or something.

Up until this point its normal builds, but now we have built the G  
server, then built the CTS server (using the *exact* artifacts from  
the G server build, even though each might have happened on a  
different build agent).  And now we need to go and run a bunch of  
tests, using the *exact* CTS server assemblies, produce some output,  
collect it, and once all of the tests are done render some nice  
reports, etc.

AHP supports setting up builds which contain "parallel" tasks, each of  
those tasks is then performed by a build agent, they have fancy build  
agent selection stuff, but for my needs I had basically 2 groups, one  
group for running the server builds, and then another for running the  
tests.  I only set aside like 2 agents for builds and the rest for  
tests.  Oh, I forgot to mention that I had 2 16x 16g AMD beasts all  
running CentOS 5, each with about 10-12 Xen virtual machines running  
internally to run build agents.  Each system also had a RAID-0 array  
setup over 4 disks to help reduce disk io wait, which was as I found  
out the limiting factor when trying to run a ton of builds that all  
checkout and download artifacts and such.

I helped the AHP team add a new feature which was an parallel iterator  
task, so you define *one* task that internally fires off n parallel  
tasks, which would set the iteration number, and leave it up to the  
build logic to pick what to do based on that index.  The alternative  
was a unwieldy set of like 200 tasks in their UI which simply didn't  
work at all.  You might have notice an "iterations.xml" file in the  
tck-testsuite directory, this was was was used to take an iteration  
number and turn it into what tests we actually run.  The <iteration>  
bits are order sensitive in that file.

Soooo, after we have a CTS Server for a particular G Server build, we  
can no go an do "runtests" for a specific set of tests (defined by an  
iteration)... this differed from the other builds above a little, but  
still pulled down artifacts, the CTS Server assemblies (only the  
assemblies and the required bits to run the geronimo-maven-plugin,  
which was used to geronimo:install, as well as used by the tck itself  
to fire up the server and so on).  The key thing here, with regards to  
the maven configuration (besides using that custom Codestation  
populated repository) was that the builds were run *offline*.

After runtests completed, the results are then soaked up (the stuff  
that javatest pukes out with icky details, as well as the full log  
files and other stuff I can recall) and then pushed back into  
Codestation.

Once all of the iterations were finished, another task fires off which  
generates a report.  It does this by downloading from Codestation all  
of the runtests outputs (each was zipped I think), unzips them one by  
one, run some custom goo I wrote (based some of the concepts from  
original stuff from the GBuild-based TCK automation), and generates a  
nice Javadoc-like report that includes all of the gory details.

I can't remember how long I spent working on this... too long (not the  
reports I mean, the whole system).  But in the end I recall something  
like running an entire TCK testsuite for a single server configuration  
(like jetty) in about 4-6 hours... I sent mail to the list with the  
results, so if you are curious what the real number is, instead of my  
guess, you can look for it there.  But anyway it was damn quick  
running on just those 2 machines.  And I *knew* exactly that each of  
the distributed tests was actually testing a known build that I could  
trace back to its artifacts and then back to its SVN revision, without  
worrying about mvn downloading something new when midnight rolled over  
or that a new G server or CTS server build that might be in progress  
hasn't compromised the testing by polluting the local repository.

  * * *

So, about the sandbox/build-support stuff...

First there is the 'harness' project, which is rather small, but  
contains the basic stuff, like a version of ant and maven which all of  
these builds would use, some other internal glue, a  fix for an evil  
Maven problem causing erroneous build failures due to some internal  
thread state corruption or gremlins, not sure which.  I kinda used  
this project to help manage the software needed by normal builds,  
which is why Ant and Maven were in there... ie. so I didn't have to go  
install it on each agent each time it changed, just let the AHP system  
deal with it for me.

This was setup as a normal AHP project, built using its internal Ant  
builder (though having that builder configured still to use the local  
version it pulled from SVN to ensure it always works.

Each other build was setup to depend on the output artifacts from the  
build harness build, using the latest in a range, like say using "3.*"  
for the latest 3.x build (which looks like that was 3.7).  This let me  
work on new stuff w/o breaking the current builds as I hacked things up.

So, in addition to all of the stuff I mentioned above wrt the G and  
CTS builds, each also had this step which resolved the build harness  
artifacts to that working directory, and the Maven builds were always  
run via the version of Maven included from the harness.  But, AHP  
didn't actually run that version of Maven directly, it used its  
internal Ant task to execute the version of Ant from the harness *and*  
use the harness.xml buildfile.

The harness.xml stuff is some more goo which I wrote to help mange AHP  
configurations.  With AHP (at that time, not sure if it has changed)  
you had to do most everything via the web UI, which sucked, and it was  
hard to refactor sets of projects and so on.  So I came up with a  
standard set of tasks to execute for a project, then put all of the  
custom muck I needed into what I called a _library_ and then had the  
AHP via harness.xml invoke it with some configuration about what  
project it was and other build details.

The actual harness.xml is not very big, it simply makes sure that */ 
bin/* is executable (codestation couldn't preserve execute bits), uses  
the Codestation command-line client (invoking the javaclass directly  
though) to ask the repository to resolve artifacts from the "Build  
Library" to the local repository.  I had this artifact resolution  
separate from the normal dependency (or harness) artifact resolution  
so that it was easier for me to fix problems with the library while a  
huge set of TCK iterations were still queued up to run.  Basically, if  
I noticed a problem due to a code or configuration issue in an early  
build, I could fix it, and use the existing builds to verify the fix,  
instead of wasting an hour (sometimes more depending on networking  
problems accessing remote repos while building the servers) to rebuild  
and start over.

This brings us to the 'libraries' project.  In general the idea of a  
_library_ was just a named/versioned collection of files, where you  
could be used by a project.  The main (er only) library defined in  
this SVN is system/.  This is the groovy glue which made everything  
work.  This is where the entry-point class is located (the guy who  
gets invoked via harness.xml via:

     <target name="harness" depends="init">
         <groovy>
             <classpath>
                 <pathelement location="${library.basedir}/groovy"/>
             </classpath>

             gbuild.system.BuildHarness.bootstrap(this)
         </groovy>
     </target>

I won't go into too much detail on this stuff now, take a look at it  
and ask questions.  But, basically there is stuff in gbuild.system.*  
which is harness support muck, and stuff in gbuild.config.* which  
contains configuration.  I was kinda mid-refactoring of some things,  
starting to add new features, not sure where I left off actually. But  
the key bits are in gbuild.config.project.*  This contains a package  
for each project, with the package name being the same as the AHP  
project (with " " -> "_"). And then in each of those package is at  
least a Controller.groovy class (or other classes if special muck was  
needed, like for the report generation in Geronimo_CTS, etc).

The controller defines a set of actions, implemented as Groovy  
closures bound to properties of the Controller class.  One of the  
properties passed in from the AHP configuration (configured via the  
Web UI, passed to the harness.xml build, and then on to the Groovy  
harness) was the name of the _action_ to execute.  Most of that stuff  
should be fairly straightforward.

So after a build is started (maybe from a Web UI click, or SVN change  
detection, or a TCK runtests iteration) the following happens (in  
simplified terms):

  * Agent starts build
  * Agent cleans its working directory
  * Agent downloads the build harness
  * Agent downloads any dependencies
  * Agent invoke Ant on harness.xml passing in some details
  * Harness.xml downloads the system/1 library
  * Harness.xml runs gbuild.system.BuildHarness
  * BuildHarness tries to construct a Controller instance for the  
project
  * BuildHarness tries to find Controller action to execute
  * BuildHarness executes the Controller action
  * Agent publishes output artifacts
  * Agent completes build

A few extra notes on libraries, the JavaEE TCK requires a bunch of  
stuff we get from Sun to execute.  This stuff isn't small, but is for  
the most part read-only.  So I setup a location on each build agent  
where these files were installed to.  I created AHP projects to manage  
them and treated them like a special "library" one which tried really  
hard not to go fetch its content unless the local content was out of  
date.  This helped speed up the entire build process... cause that  
delete/download of all that muck really slows down 20 agents running  
in parallel on 2 big machines with stripped array.  For legal reasons  
this stuff was not kept in svn.apache.org's main repository, and for  
logistical reasons wasn't kept in the private tck repo on  
svn.apache.org either.  Because there were so many files, and be case  
the httpd configuration on svn.apache.org kicks out requests that it  
thinks are *bunk* to help save the resources for the community, I had  
setup a private ssl secured private svn repository on the old  
gbuild.org machines to put in the full muck required, then setup some  
goo in the harness to resolve them.  This goo is all in  
gbuild.system.library.*  See the  
gbuild.config.projects.Geronimo_CTS.Controller for more of how it was  
actually used.

  * * *

Okay, that is about all the brain-dump for TCK muck I have in me for  
tonight.  Reply with questions if you have any.

Cheers,

--jason



Mime
View raw message