gump-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject [RT] Gump 3.0
Date Mon, 06 Dec 2004 23:13:28 GMT
I've been working for a while to describe an improved architecture for 
Gump and I have decided to "go public" with the discussion because I 
want this to be a community effort.

                               - o -

First and foremost, I believe that gump is one of the most exiting 
things happening that ever happened in the software space over the last 
few years but I also thinks that both technical, architectural and 
social limitations are stopping it from exihibit its real potential.

The biggest problem I have is the fact that gump is such an integrated 
system: it tries to do too much in one single stage.

Don't get me wrong: the internals  of gump 2.x are rather modular and 
well architected, but the overall system architecture is too monolithic.

So, here is my first suggestion: split gump in three stages.

  1) metadata aggregation
  2) build
  3) build data use

                                - o -

Stage 1: Metadata aggregation
-----------------------------

Gump will socially scale only when the metadata about the problem will 
be taken care by the people that administer the project rather then a 
few gump meisters.

In this regard, I believe Maven to be far superior in term of 
gump-friendliness than ant because of its complete declarative nature 
(ant builds are a functional language, where project metadata cannot be 
transparently be inferred from them).

In a perfect world, all project would *need* an metadata representation 
of their structure so that a build tool can parse that and understand 
what the project needs.

In the real world, there are two camps:

  1) procedural: make,configure,sh,ant
  2) declerative: maven,apt-get,ports

and the second normally build on the first one.

The absolute need for gump (or apt-get, or BSD ports) is to have a 
"declarative" layer on top of the "procedural" one for every project, a 
'semantic' layer that the system can understand and work on.

Debian shows that it's possible to socially scale the concept of adding 
a semantic layer on top of existing project efforts, in a completely 
independent fashion.

Maven shows that it's possible for the projects themselves to make good 
use of this information (also calling ant, if special needs are required).

For gump, what's important is that having maven generate gump 
descriptors is both stupid and inefficient: gump should be able to 
digest directly the maven POM, without requiring any effort from the 
project.

We should be maintaing the metadata representation only for the projects 
that don't have that data integrated in their build system (like pure 
ant projects or make/configure projects).

So, what is a metadata aggregation layer?

It's a crawler for project metadata. Crawls project and their 
descriptors and aggregates them in a service that can be queried to 
obtain that information.

In short

    [bunch of locations] --> crawler --> metadata database

                              - o -

Stage 2: Build
--------------

This is what today we think as "gump". In short, it's the service that 
uses the project metadata, does the fetching, preparing, building and 
generates a bunch of data as a result.

The difference from today's gump is that this "build-only gump" outputs 
data into a database, not into HTML pages or RSS scripts. The build 
stage and the data use stage are separated.

In short:

    metadata database --> gump --> build data database

                               - o -

Stage 3: Build Data Use
-----------------------

This is what todays is performed by the 'actors' inside Gump 2.x, the 
current actors are:

  1) document
  2) repository
  3) notify
  4) stats
  5) syndication
  6) timing
  7) rdf
  8) mysql
  9) results

we could aggregate them in the following taxonomy:

  [web]
    [html]
     document -> creates the forrest output
     results -> creates the XHTML output
      stats -> does the stats part
      timing -> does the timing part
    [others]
     syndication -> does the RSS feeds
     RDF -> does the RDF descriptors
  [email]
    notify -> notifies the mail lists
  [history]
    mysql -> saves historical data
    repository -> saves the built jar files

My suggestion is to remove all those away from the stage 2 and just let 
the "historical" actors be in stage 2 (basically pumping all the data 
into the historical database) and let the others reside in stage 3.

So, for stage 3 I see two possible services:

  1) the web service, taking care of things like:
      - web pages
      - historical graphs
      - syndication of results

  2) the notification service, taking care of sending emails to the 
various projects

In short:

    metadata database   --+  +--> email notifier
                          +--+
    build data database --+  +--> webapp

                          - o -

Advantages
----------

This new architecture has several advantages:

  1) the concerns are more easily separated, also means that different 
stages can be built using different languages. The webapp, for example, 
that I'm working working on (codename 'dynagump' and located in 
http://svn.apache.org/repos/asf/gump/dynagump/trunk) is a Cocoon 
application.

  2) by decoupling the architecture, it's easier to have multiple 
machines running the second stage in parallel (both controlled by us or 
simply donated by the users) for example

                         --- Debian on x86 ---
                        /                     \
                       /                       v
      metadata database ---- MacOSX on PPC ---> build data database
                       \                       ^
                        \                     /
                         --- WinXP on x86 ----

  *and* is also easier to install a "build stage" on a given machine, 
since the metadata bootstrap phase should be done automatically. for 
example, it should be sufficient to say "gump build asf:cocoon" in order 
to the whole system to be prepared and packaged and ready to go.

  3) also by allowing gump to adapt the existing descriptors into a 
database form, it's easier to empower users by either allowing them to 
maintain their data in the original form (ie. Maven descriptors) or to 
adapt/modify the data in the database directly (for example, thru a web 
application).

  4) the contracts between the stages are databases, once these models 
are codified, it's possible for the three stages to work in complete 
isolation, without affecting one another.

Comments?

-- 
Stefano.


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org


Mime
View raw message