cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Upayavira" ...@upaya.co.uk>
Subject CLI ideas (long)
Date Mon, 31 Mar 2003 19:11:51 GMT
Dear All,

Below is the a summary of a brief exchange with Nicola Ken 
regarding CLI ideas I'd like to implement. He has encouraged 
me to 'go public', which I am now doing.  

My aim in the below is twofold: make the CLI into something 
that is useful to a project I am working on, and also to make 
the CLI into something that people would prefer to use as 
opposed to something like wget. [Confession: I'm afraid I 
still use wget myself.]

--ModifiableSources--

The Bean's Destination objects will be replaced with
ModifiableSources.

A destination can be configured with a uri which identifies a
ModfiableSource via its protocol. So instead of specifying a
Destination, you specify a uri, and a sourceResolver identifies the
ModifiableSource.

Then all that remains is how to work out the actual uri of a
particular output file, based upon the destination uri and the
source uri. 

This can be done by specifying the source uri as a base
and a path. The base is used to request a page, but is not appended
to the output uri. When the source path should be inserted into the
destination URI, the insertion point can be marked with a *, e.g.

Ftp://ftp.host.com/htdocs/*
zip://path/*@foo.zip

If no * is present, and the destination URI ends with a /, the
source path is appended to the output uri. If it does not end with a
/, the output URI only is used, and the source path is discarded.
 
The source uri is identified by combining the base & the path
(separated by a file separator).

Exactly how these will be supplied within the xconf, I've still to
work out, but each page will need a source base, source path and a
destination URI. 
 
The Bean gets a ComponentManager from its Cocoon object, and uses
this to get a SourceResolver for its own use.

Basically, this allows maximum configuration of resulting URIs. At present, for a 
URI of /site/page.html, this will be put into $dest/site/page.html. But what if you 
want that page to be at the root of your site, i.e. you don't want 'site' at the 
beginning? Well, in this case, you'd specify '/site' as the base and 'page.html' as 
the path. If page.html contained a link to something/anotherpage.html, this would 
have /site as its base and something/anotherpage.html as its path.

So any pages that are linked to will inherit the destination URI and source base 
from the linking page.
 
---FTPSource------
I need a writable FTPSource for a project I'm working on. Nicola Ken suggested 
looking into a VFSSource, which I will do. It shouldn't be hard to produce a 
ModifiableSource for it. I presume, to make it work, you'd configure multiple 
protocols in cocoon.xconf to point to the same code, e.g. ftp, zip, smb, etc.

[Caching thoughts removed - return to that some time later]

---Configuring a ModifiableSource---
>From all of the Sources I've seen, I haven't seen ways to pass configuration 
parameters into them. For example, how might one tell an FTPSource to use 
passive as opposed to active FTP? Any ideas?

---Source Caching------
> Cocoon probably has a lot of code for caching sources. There are two sides to caching,
improving processing time by reducing workload, and reducing writing time by not updating
pages that have not changed.

The former is already managed by Cocoon. The latter would require the pipeline to 
report if the serializer output was read from the cache. If so, content isn't written. 

[Nicola Ken - I didn't understand this bit of your reply:]
> Actually even the former is managed by Cocoon, I don't remember where but
> IIRC the Environment has such an info, only that in the current
> implementation of the CLI environments it's unimplemented.

As Nicola Ken pointed out, links of every page would need to be cached, because 
when a page will be found to be already on disk and uptodate, you still need the 
links for crawling. Hmm. 

---Threading---
Threadinq needs reworking as the ThreadedDestination would become
deprecated. 

The bean either needs to have threaded processing built in, or I need to create 
something like a 'threadable source', using something like threaded:ftp://blah. 
Don't know which yet.

There are two possible forms of threading: generation and dispatch
threading. 
 
In generation threading, multiple pages are generated
simultaneously. The benefit of this is that pages are likely to
appear more synchronously at the destination. The downside is that
the processor needs to be switching between multiple threads 
(assuming single processor m/c)
 
Dispatch threading involves sequential page generation and then
handing the generated content to a thread-pool to handle dispatching
the content to its destination. The benefit of this greater speed of
delivery when delivery takes place over a slow network connection.

This kind of threading is important for a system that I want to use it for. 
The pages bear no relevance to each other, and speed of delivery is 
important.  (I don't plan to implement generation threading ATM).

Threading is either configured once for an xconf file, or as a part of a threaded
source URL. Default would be no threading.

Final comments from Nicola Ken:
> What about a publish-subscribe model, with complete decoupling from
> the publishing and the handling?

Can you explain more what you mean by this?

> As points that are important, I would say in order:
> 
>   1) make Cocoon *not* output the pages that have an error
>   2) make cocoon output xxxpagename.error.txt with the errors
>      of the 'xxxpagename' page (configurable)
>   3) make the report on broken links in XML so that it can be
>      added to the site (where to put it configurable)
>   4) make the content not regenerated if uptodate (very important
>      from a user perspective POV)
>   5) use ModifyableSource instead of Destination
>   6) others
> 
> Feel free to do whatever in whatever order you prefer, this is just
> what IMVHO is the priority. 1+2 are needed BTW so that crawlers see
> broken links correctly, otherwise the site seems ok but instead the
> broken links are there.

Do you have ideas as to how to do these (i.e. 1-4)? 5 is of greatest importance to 
me, but if I can understand what is involved in the others, then I can always have a 
go.

Regards, Upayavira


Mime
View raw message