cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <>
Subject [finally!] The "Infamous" Cocoon Sitemap Proposal
Date Tue, 28 Dec 1999 23:50:40 GMT
Ok, were we go.

This document proposes the Cocoon sitemap.

Moreover, starting from last week, Cocoon is also my thesis for my
college degree
which title is "XML Web Publishing". Nothing changes from your
perpective and I won't be more or less intrigued by this technology,
that will allow me to focus on this without having my girlfriend say
"your job is to finish up school first" and all that stuff :)

Anyway, by the end of my thesis (end of next year, or hopefully before),
Cocoon 2 will have to be a complete reality in all its parts. Everything
that is planned, thoughts, or in the todo and wish list (at this time)
will have to be implemented.

But, of course, I plan to finish it much earlier that that :)

So, here we are: the following is written by me, but the
underlying ideas reflect months of discussions on this list but were
brought to life in the last two days where Pier and I spend all day and
night (until 6AM!!!) sharing ideas about this.


--------------------- cut here ----------------------------

             The Infamous Cocoon Sitemap


  Cocoon 1.x grew up quickly and there was not enough knowledge to
understand how things would have evolved in the XML world.

  Cocoon started with one fixed chain of DOM processors. This was early
considered to be a design flaw.

  Then at the Exolab in May, I proposed the user of the Reactor pattern.
While this allowed to reach the flexibility point we have today, it
forces distribution of control, which creates too many contracts between
the different contexts. Result: increase of management costs.

  Also, the use of the reactor pattern with an event driven model soon
appeared very complex to implement. Moreover, the number of processing
pipelines used in an average site are very small compared to the number
of resources/pages.

  At the end, the pyramid model of site management, forces a better
separation between the working contexts (content, style and logic) and
the management context.

  Mainly for these reasons, the sitemap was proposed as an alternative
solution that helps to centralize site management and reduce its costs,
without limiting the flexibility of the Cocoon highly-modular

What do you mean by "sitemap"?

  A sitemap is a configuration repository or a collection of them. We'll
assume there will be ways (XInclude, Infosets, external entities, XML
Inheritance) to create a single repository out of a collection with no
loss of information/structure.

  A configuration repository is a structured collection of
configurations (please, refer to the interfaces on the xml-cocoon2
branch for more info on the design patterns that go after this, or look
at the Avalon sources) that instructs the resource processing.
  A resource process is the action of generating a response from a given

  A resource is uniquely indicated by a URI and it's used to trigger the
processing, if this URI is "mounted" to Cocoon.

What do you mean by "mounting"?

  Exactly like you mount the root of a file system onto another path in
another one (see the Unix "mount" command), the Servlet 2.2 model
follows the "mount" pattern, by allowing all URIs that start with the
mounted path to be directed to the web application that is in charge of
process the request.

  Today, Cocoon doesn't follow the mount pattern, but uses the extention
reaction pattern. This pattern causes many problems since it cannot be
restricted to certain parts of the URI scope.

  The extention reaction pattern, at least at the web server level, will
be abandoned in favor of the URI mounting pattern.

Resource partitioning

  In theory, each URI has it's own process. In real life, the mounted
URI space can be partitioned into equi-processing areas where the
requests go thru the same processing pipeline.

  First necessity of a sitemap is the ability to partition the URI space
in a flexible, yet simple manner. here is an example

   <process uri="/index.html">
   <process uri="/docs/*/special/*.xml">
   <process uri="/docs/*.xml">
   <process uri="*.pdf">
   <process uri="/images/dynamic/*.jpg">

which partitions the site into "processing areas".

 There are issues to note:

1) the order of the URI elements _is_ functional. The sitemap is not
always a mathematical partition but its a mathematical "coverage" of the
URI space. This means that some URI may match more than one pattern. For
this reason, the order of matching is important

2) the pattern logic will be pluggable. There will be a way to implement
"pattern matchers" that implement the following interface

 public interface PatternMatcher {
   boolean matches(String str, String pattern);

3) the default matching logic for URI matching will the simple
"star-based" logic, where the '*' char indicates any char and any number
of chars.

Why not use Regular Expressions?

  The matcher interface will allow you to implement you own matching
syntax and regexp might be added easily, in case this is found useful.
This said, I think that regexp are much too complicated for what we
need: true, they define a _very_ powerful way to deal with strings, but
their syntax is rather unfriendly for newbies and rather slow to

  Also, my own personal opinion is that if you need the power of regexp
to partition your URI space, there's something wrong in your URI space
design. Regexp are very useful for stuff like file searching, or
unstructured text searching. Like PERL author Larry Wall says, PERL is a
design mess and adapts very well to everything that is a design mess.

  Here, since we don't have legacy stuff around, I'd rather force you to
think more about how to partition your URI space rather than giving you
the power to do it. Painful at first? yes, sir. But with a good reason.
(and if you disagree and love regexps, change the matcher and use them)

Isn't this sitemap getting a little big?

  A complex web site may have millions of valid URIs, but I estimate the
amount of pipelines to be a function of the logarithm of the number of
URIs present in the URI space. Something like 

    p(n) = a * log(n) + b

where 'n' is the number of URIs and 'p' the number of pipelines, a and b
tuning factors.

  The importance of this must not be underestimated: even for small
sites, the number of pipelines is manageable, but when the site scales,
the number of pipelines grows significantly lower than the scaling
factor of the whole information system.

  If you don't get it, show it to your web system administrator: he'll
know what this means :)

The processing area

  Now, we have partitioned our URI space into processing areas. In a
processing area, all URIs go thru the same processing pipeline. From a
Cocoon point of view, they are considered like they all were painted
with the same color and there is a different pipeline for each different

  A processing pipeline is formed by (in general)

  - one producer
  - none or more filters
  - one serializer

so the URI area definition should, at least, look like this

 <process uri="*.xml">
  <producer .../>
  <filter .../>
  <filter .../>
  <serializer .../>
  <!-- additional area-specific configurations -->

Module parameters

  To be able to create a DTD for the sitemap, we can't hardwire the
parameters as attributes, but we need to make the DTD flexible enough to
stand Cocoon modularity. So we need something like

 <producer name="file">
  <param name="mount-point" value="/home/www/xdocs"/>


 <filter name="xslt">
  <param name="stylesheet" value="sheets/page-html.xsl"/>

Ok, enough introduction, let's get to the real stuff... here is a
snapshot of the new Cocoon configurations:

    <!-- ... -->

  <sitemap syntax="...">
    <process uri="..." translate="..." errorhandler="...">
      <matcher class="...">
        <param name="..." value="..."/>
      <producer name="..." >
        <param name="..." value="..."/>
      <filter name="...">
        <param name="..." value="..."/>
      <serializer name="...">
        <param name="..." value="..."/>

    <read uri="..." translate="..." errorhandler="...">
      <matcher class="...">
        <param name="..." value="..."/>

    <redirect uri="..." translate="..." errorhandler="...">
      <matcher class="...">
        <param name="..." value="..."/>

    <!-- do we need to generate HTTP errors? -->


    <component class="...">
      <param name="..." value="..."/>
    <component class="...">
      <param name="..." value="..."/>
    <component class="...">
      <param name="..." value="..."/>
    <producer name="..." class="...">
      <param name="..." value="..."/>
    <producer name="..." class="...">
      <param name="..." value="..."/>
    <filter name="..." class="...">
      <param name="..." value="..."/>
    <filter name="..." class="...">
      <param name="..." value="..."/>
    <serializer name="..." class="...">
      <param name="..." value="..."/>
    <serializer name="..." class="...">
      <param name="..." value="..."/>


1) the producers/filters/serializers used inside the <process> element
automatically inherit the parameters indicated in the definitions below
to reduce verbosity.

2) components must implement interfaces of the form


and the component is registered with the role "type" in the

3) <process uri="..." translate="..."> applies URL very simple URL
rewriting. For example, given the URI /docs/cocoon/index.xml the element

 <process uri="*/*.xml" translate="/usr/local/www/*/new/*.xml">

matches the given URI and rewrites it to


4) the <matcher> element defines a way to express matching logic that is
not URI based. The class must implement the interface

 public interface Matcher {
    boolean match(Request request);

that returns true if the given request is matched by this class. For
example, user-agent matching should be done in this way, or
session-parameter matching... you have the chance to write your own URI
partitioning logic without requiring the matcher to look for all request

5) <read> and <redirect> allow to complete normal web site operation,
given that Cocoon now handles a complete URI subspace thru the mount
point and binary content cannot be processed by cocoon inside the normal

6) the sitemap is a single file and it's included inside the cocoon
configurations. This follows the servlet API 2.2 patterns proposed with
the web application idea. If you need different URI subspaces with
separated sitemaps, then you are forced to have two different cocoons.

All right, this is, for us, a complete proposal. Nothing that has been
proposed for Cocoon cannot be done in this sitemap. I'm very proud of
this and I think that Pier and I did a good job, but we'll be very
welcome to clean it up further and test its strenght thru public review.

So, let's see how solid this is :)

Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<>                             Friedrich Nietzsche
 Come to the first official Apache Software Foundation Conference!  
------------------------- http://ApacheCon.Com ---------------------

View raw message