cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Turner <>
Subject Re: cli.xconf questions
Date Mon, 04 Aug 2003 12:38:55 GMT
On Mon, Aug 04, 2003 at 08:25:01AM +0000, Upayavira wrote:
> On Sat, 2 Aug 2003 22:08:21 +1000, "Jeff Turner" <> said:
> > Hi,
> > 
> > I'm tinkering around with the CLI, thinking how to add
> > don't-crawl-this-page support, and have some questions on how cli.xconf
> > currently works.  The following block in cli.xconf has me confused..
> Jeff. Great to see you're engaging with it!

It doubled Forrest's speed - I love it ;)

> I have also been working on the CLI. I've spent my week's spare time
> completely reworking it. I'll post separately about what I've been up to,
> but basically the whole thing should be much easier to understand, with a
> separate crawler class, a separate class for handling Cocoon
> initialisation, and another for handling URI arithmetic (which you're
> talking about below). As to adding exclusions, I think it should merely
> be a question of identifying the syntax. The rest, with my new code,
> should be pretty easy (e.g. tell the crawler what to ignore with a set of
> wildcard parameters).

Sounds marvellous.

> I haven't been able to debug this, as my copy of Eclipse insists on
> entering Java's Classloader code when I try to debug it. When I've worked
> out how to stop Eclipse doing that, I'll get it debugged, and put it into
> the scratchpad. 

IDEA also steps into JDK code, but can't you just 'step over' the code
instead of diving into it?  F6 I think.

> When I've got this going, I'm going to convert the xconf code to use a
> Configuration object, and then write an Ant task to do the same
> ProcessXConf, so that you can have the xconf code directly in your Ant
> script. This Ant task will be a simple wrapper around the bean, and
> should be pretty trivial.

Mmm.. nice.  Might be some ideas to steal from Ant here, notably the idea
of PatternSets and Mappers.

> I have also, I think, just sorted my problem with my caching code not
> working. Basically, the Cocoon cache is transient. So therefore it is
> lost every time Cocoon starts. And Cocoon is started every time the CLI
> starts. So if we want to have the CLI only generate new pages based upon
> the cache, we've got to make the cache for the CLI persistent. Again, see
> separate thread.

This would be really awesome :)  Lots of people have asked if Forrest
could only regenerate pages that have changed.  I'll defer further
thoughts till the other thread.

> > Come to think of it, the attribute name 'src'
> > doesn't really make sense.  What is the "source" of a Cocoon URI?  It
> > would be the XML (documents/index.xml), which is not what we're
> > specifying in @src.
> It is the source for a source/destination pair. You could see it as a
> cocoon: protocol source (almost). Would you suggest something different?

No, makes sense given that explanation.

[snip enlightening description of cli.xconf syntax - thanks!]

> > I have the feeling that cli.xconf's job, mapping URIs to the filesystem,
> > could potentially be quite intricate.  It is roughly an inverse of what
> > the sitemap does.  Perhaps we need an analogous syntax?
> Perhaps. I think we've only just started trying to work out what is
> possible here. I'd be pleased to carry on the conversation, as what we
> have at the moment is purely what I thought best, and not the result of
> much community discussion.
> There's alot we could discuss here. For example, how do we handle the
> situation where we want to crawl a number of pages, but don't want to
> have to repeat the destination for each of them? I think we could come up
> with an elegant configuration for this. My <uri> thing is only the
> beginning. 

There is ${variable} interpolation code in Avalon, if that helps.  Eg.
${context-root} in logkit.xconf.

> The first thing to do is to start identifying the possible use cases for
> URI mappings, so that we can see the range of the problem we're trying to
> solve (and take it beyond the scope of just fixing my problems only!).

Well, two observations:

1) Hosting a live Cocoon site is a PITA:

 - One has to fight with sysadmins to install JVMs.  Many site hosts
   (like SF) don't even offer Java-based services.
 - JVMs permanently chew up vast amounts of memory
 - Servlet containers hang, crash, throw OutOfMemoryExceptions and are
   generally unreliable.
 - Cocoon is not particularly fast

2) A surprising number of sites **don't need to be dynamic**

So in walks our hero, the CLI.  We can get most of the magic of Cocoon,
with none of the pain.  Develop a site with a live Cocoon, and when
you're ready to deploy, serialize it to disk and serve through Apache.

That's why I think the CLI is very important.  More than *anything* else,
it has the potential to vastly widen Cocoon's audience.

So from this perspective, the need is simple.  We need the CLI to provide
as accurate a representation of the live site as possible.  Generally
this means simply mirroring the URI structure to disk.

Currently, the biggest unmet need is the ability to exclude certain URLs.
There is usually non-Cocoon-generated content like Javadocs, or other
parts of the site, which needs to be excluded.


> I have said previously that the Bean interface should be declared
> alpha/unstable. By the sounds of it we also need to declare the xconf
> structure to be unstable too. See separate thread!
> Regards, Upayavira

View raw message