I've been working for a while to describe an improved architecture for
Gump and I have decided to "go public" with the discussion because I
want this to be a community effort.
- o -
First and foremost, I believe that gump is one of the most exiting
things happening that ever happened in the software space over the last
few years but I also thinks that both technical, architectural and
social limitations are stopping it from exihibit its real potential.
The biggest problem I have is the fact that gump is such an integrated
system: it tries to do too much in one single stage.
Don't get me wrong: the internals of gump 2.x are rather modular and
well architected, but the overall system architecture is too monolithic.
So, here is my first suggestion: split gump in three stages.
1) metadata aggregation
2) build
3) build data use
- o -
Stage 1: Metadata aggregation
-----------------------------
Gump will socially scale only when the metadata about the problem will
be taken care by the people that administer the project rather then a
few gump meisters.
In this regard, I believe Maven to be far superior in term of
gump-friendliness than ant because of its complete declarative nature
(ant builds are a functional language, where project metadata cannot be
transparently be inferred from them).
In a perfect world, all project would *need* an metadata representation
of their structure so that a build tool can parse that and understand
what the project needs.
In the real world, there are two camps:
1) procedural: make,configure,sh,ant
2) declerative: maven,apt-get,ports
and the second normally build on the first one.
The absolute need for gump (or apt-get, or BSD ports) is to have a
"declarative" layer on top of the "procedural" one for every project, a
'semantic' layer that the system can understand and work on.
Debian shows that it's possible to socially scale the concept of adding
a semantic layer on top of existing project efforts, in a completely
independent fashion.
Maven shows that it's possible for the projects themselves to make good
use of this information (also calling ant, if special needs are required).
For gump, what's important is that having maven generate gump
descriptors is both stupid and inefficient: gump should be able to
digest directly the maven POM, without requiring any effort from the
project.
We should be maintaing the metadata representation only for the projects
that don't have that data integrated in their build system (like pure
ant projects or make/configure projects).
So, what is a metadata aggregation layer?
It's a crawler for project metadata. Crawls project and their
descriptors and aggregates them in a service that can be queried to
obtain that information.
In short
[bunch of locations] --> crawler --> metadata database
- o -
Stage 2: Build
--------------
This is what today we think as "gump". In short, it's the service that
uses the project metadata, does the fetching, preparing, building and
generates a bunch of data as a result.
The difference from today's gump is that this "build-only gump" outputs
data into a database, not into HTML pages or RSS scripts. The build
stage and the data use stage are separated.
In short:
metadata database --> gump --> build data database
- o -
Stage 3: Build Data Use
-----------------------
This is what todays is performed by the 'actors' inside Gump 2.x, the
current actors are:
1) document
2) repository
3) notify
4) stats
5) syndication
6) timing
7) rdf
8) mysql
9) results
we could aggregate them in the following taxonomy:
[web]
[html]
document -> creates the forrest output
results -> creates the XHTML output
stats -> does the stats part
timing -> does the timing part
[others]
syndication -> does the RSS feeds
RDF -> does the RDF descriptors
[email]
notify -> notifies the mail lists
[history]
mysql -> saves historical data
repository -> saves the built jar files
My suggestion is to remove all those away from the stage 2 and just let
the "historical" actors be in stage 2 (basically pumping all the data
into the historical database) and let the others reside in stage 3.
So, for stage 3 I see two possible services:
1) the web service, taking care of things like:
- web pages
- historical graphs
- syndication of results
2) the notification service, taking care of sending emails to the
various projects
In short:
metadata database --+ +--> email notifier
+--+
build data database --+ +--> webapp
- o -
Advantages
----------
This new architecture has several advantages:
1) the concerns are more easily separated, also means that different
stages can be built using different languages. The webapp, for example,
that I'm working working on (codename 'dynagump' and located in
http://svn.apache.org/repos/asf/gump/dynagump/trunk) is a Cocoon
application.
2) by decoupling the architecture, it's easier to have multiple
machines running the second stage in parallel (both controlled by us or
simply donated by the users) for example
--- Debian on x86 ---
/ \
/ v
metadata database ---- MacOSX on PPC ---> build data database
\ ^
\ /
--- WinXP on x86 ----
*and* is also easier to install a "build stage" on a given machine,
since the metadata bootstrap phase should be done automatically. for
example, it should be sufficient to say "gump build asf:cocoon" in order
to the whole system to be prepared and packaged and ready to go.
3) also by allowing gump to adapt the existing descriptors into a
database form, it's easier to empower users by either allowing them to
maintain their data in the original form (ie. Maven descriptors) or to
adapt/modify the data in the database directly (for example, thru a web
application).
4) the contracts between the stages are databases, once these models
are codified, it's possible for the three stages to work in complete
isolation, without affecting one another.
Comments?
--
Stefano.
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org
|