cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Russell <>
Subject Re: Cocoon Offline mode and sitemap changes
Date Tue, 23 May 2000 23:18:34 GMT
Hi All,

Firstly, apologies for the delay, I *started* to write this
e-mail this morning, and got buried by other things (bank
managers, mainly). Secondly, I have a feeling this e-mail
is going to end up rambling somewhat, and for this I apolo-
gise in advance.

Thirdly, this is taking things from close to the top. I'm
explaining a fair bit of the basics of Cocoon2 here to make
sure that as many people as possible can think about this
(hey, I'm lazy, if everyone else is using their braincells,
I can give mine a rest ;)

Since Stefano took the opportunity to fill you in on what
I did first time around, I'll not go into too much detail
on that front (he's pretty much got that licked).

On Tue, May 23, 2000 at 02:58:48PM +0200, Stefano Mazzocchi wrote:
> Paul wrote all three of them they do their job very well. The problem is
> are totally XML-unaware. And this is, IMO, a big design fault.

Yep, totally agree. I'm still (even now) getting to grips
with all the semantics of some of the XML architecture,
particularly XLink etc. The current offline module was
written in what felt like the only way to do it given
the current Cocoon2 architecture. Because I was new to
Cocoon at the time, I didn't really feel confident enough
to start suggesting changes to the sitemap ;)

> 1) the offline generator. The class that implements this is
>  org.apache.cocoon.offline.CocoonOffline
> I don't have problems in keeping this as it is, but suggestions are
> welcome.

I'm not keen on the way CocoonOffline works, currently. At
present, it extends Cocoon, and I think this is semantically
dubious. I'd much rather it *used* Cocoon. Again, the reason
it was done like this initially was (a) to get it out the door,
and (b) to avoid having to change too much of Pier's code. Now
Cocoon2 is a bit more open, I think we can move it over to what
IMO makes more sense. Do you guys agree that content providers
(servlets, offline, [something else?!]) should *use*, rather
than extend Cocoon?

> 2) the crawler (Paul called it sitewalker, but I like crawler much more) 
> Paul identified the need for multiple crawlers to generate a site. Is
> this flexibility syndrome?

This was me saying "I don't like crawling the site; there must
be a better way, but I can't think what it is just yet, so I'm
going to abstract that away as much as possible." I *think* I
ended up poluting the abstraction slightly, looking back on it.

> Should each target have one crawler? Should we have more than
> one entry point?

Not sure what Stefano means exactly here, so if I've misunder-
stood, ignore me ;).

The primary reason for having multiple 'startpoints' as I called
them was because sometimes a crawler won't find a certain page,
either because the link is absolute (which my crawler simply
ignored on the basis that it wasn't safe to try and handle that)
or because there simply isn't a way of getting to it from the

When I refer to a 'target', I mean 'somewhere to put the result'.
The initial implementation focused on output to a particular
directory, specified by the 'target' attribute of the 'offline'
tag. One possibility I looked at was to abstract the target
so that the module could pump code directly onto a webserver,
or store it in a database (mummy! scary!) or HTTP PUT it, or
whatever other interesting ideas you guys come up with. For me,
this is a double edged sword. Most of me says KISS (Keep It
Simple, Stupid), and keep it to filesystems, the other side
of me (the OO design side) goes with the Lock Stock and Two
Smoking Barrels principle:

   "If it moves, abstract it; if it doesn't move, abstract
    it anyway... Understand? Good, cos if you don't, I'm
    gonna abstract ya."

What do you guys think? Are the potential risks of letting
people target whatever they want (and risking codebase bloat)
worth it? I'm inclined to say abstract it, but don't include
and targets other than FileSystemTarget in the base system,
unless we're really really sure it's A Good Thing.

> 3) the link parser.
> This is the most important design decisions and I believe that while
> clever, Paul's idea of using MIME-driven link parsing may become very
> dangerous. Suppose we generate FO + SVG: do we have to parse it back to
> have the links? Do we have to create a link parser for every meaningful
> MIME-type our formatters support?

Again, I agree. (Stefano, could you kindly stop being right
all the damned time? ;)

> I still believe XLink is the solution.

*Again*, I agree. (see above)

What I can't quite get my head around is how to actually get XLink
into the equation. Linking one XML document to another is quite
another thing to preserving those links through the XSLT translations
that we're putting them through before they get to the client
(which in the case of the offline code, happens to be a file)
and working out what the request we need to give to the Cocoon
object to generate the required result is.


Okay, at this point, I'm going to leave Stefano's e-mail, and
basically explain the issues as I see them for offline
generation. There is a distinct possibility I'll drift off
into a few other things I've been thinking about recently,
but consider them to be Random Thoughts (&copy; Stefano ;).

Both Cocoon1.x and Cocoon2 work on what I call the "request,
response" principle. This works absolutely wonderfully for
servlets and most other internet/web based scenarios, however
it isn't ideal for offline work.

Let's turn the thing on its head.

How do I make an offline site? Well, I take a load of XML
sources (notice they aren't necessarily static), I transform
them and manipulate them in various ways, and then I serialize
them into their final binary file format. The sitemap as it
stands takes us about half way there. Given a target URI and
a source URI, it can tell me what to do to get from one to 
the other.

At present, the sitemap works like this:

    <process uri="/**" src="**.xml">
      <generator name="file"/>
      <filter name="xslt">
      <serializer name="html">

So, what does this actually mean? To follow this, it might help
to understand how Cocoon2 requests work...

Reqest Object   \
Response Object  |--> Cocoon.process()
Output Stream   /

Cocoon then works out from the sitemap (well, technically this
is all handled within the Sitemap and SitemapPartition classes,
but that's fairly academic at this stage) what the src URI is,
and what processes to put the XML found in that souce URI

So, for example, using the above sitemap, say I requested
'/index'. Cocoon2 would work out that the XML source came from
a generator called 'file', and the URI to give that generator
is 'index.xml' (note the matching sets of asterisks).
It would then parse the XML file, and pass the resulting SAX
stream through an XSLT translator and into an HTML serializer.

For servlets, and other 'live' requests, this works great.
When a user asks for something, we generate it. If we can't,
we keel over.

The problem comes when we attempt to do things the other way
around. When we're generating a site offline, we have to
work out all the possible combinations of requests users could
throw at Cocoon. In the above case, where the XML is coming
from a file, it's trivial - we just translate backwards from
the files we can see on disk. If, however, the XML content
comes from somewhere else, or we're using matching code that
enables 'many to one' mappings, the whole thing falls apart.
We can't 'guess' what 'source' URIs the generator supports,
and we can't translate the 'one' to the 'many' without
generating every purmutation. I don't know about you lot, but
I don't have a quantum computer, so I don't fancy that last
option ;)

The only answer I've come up with so far, is to 'Spider' or
'crawl' the site, in a similar way to my initial implemen-
tation. If anyone can think of a better one, I'd love it,
HintHint (any Wiki fans out there? ;).

Now, the way this worked in my implementation was to spider
the *result* (post serialization) of the request, depending
on mime type. This worked well(ish) for HTML, but it isn't
going to work nearly well enough long term. How can I excuse
myself? I was young and nieve, and it seemed like a sensible
solution at the time ;)

As Stefano has said, XLink is the answer. This would enable
the offline processor (name please!!) to spider over the
source XML relatively easily. The problem with this comes
with pluggable matchers - what if one source XML file/
generator produces a number of target documents? This might
not seem like that likely an occourance, but Cocoon2 is
designed to be a pretty damned serious piece of kit. I
fully intend to be generating title images, SVG
visualisations, and god knows what else. Some of this is
likely to come from inline data (particularly the title
images), and so we have to consider this.

It's at this point that I get a bit stuck. I can't see a
way around this problem. I could really do with you guys
having a good hard think about it, to see if you visualise
a way around it. I might just be being stupid or missing
something simple (heck, I *hope* that's the case :) but
I could do with a bit of external input on it, frankly.

Okay, it's now gone midnight over here, and I think it's
time I got some sleep. I hope the above has given everyone
something to think about, and hasn't confused people even

All thoughts *very* greatfully recieved <g>

Paul Russell                               <>
Technical Director,         
Luminas Ltd.

View raw message