Return-Path: Delivered-To: apmail-xml-cocoon-dev-archive@xml.apache.org Received: (qmail 59787 invoked by uid 500); 31 Mar 2003 19:13:24 -0000 Mailing-List: contact cocoon-dev-help@xml.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: cocoon-dev@xml.apache.org Delivered-To: mailing list cocoon-dev@xml.apache.org Received: (qmail 59774 invoked from network); 31 Mar 2003 19:13:24 -0000 Received: from ny3.fastmail.fm (HELO smtp.us.messagingengine.com) (66.111.4.4) by daedalus.apache.org with SMTP; 31 Mar 2003 19:13:24 -0000 Received: from www.fastmail.fm (server1.internal [10.202.2.132]) by fastmail.fm (Postfix) with ESMTP id 9CB284A3F3 for ; Mon, 31 Mar 2003 14:13:26 -0500 (EST) Received: from 127.0.0.1 ([127.0.0.1] helo=www.fastmail.fm) by messagingengine.com with SMTP; Mon, 31 Mar 2003 14:13:25 -0500 X-Epoch: 1049138005 X-Sasl-enc: NvFbISnQnRVIhbScBbku1g Received: from charya (unknown [213.48.13.34]) by www.fastmail.fm (Postfix) with ESMTP id 0240A2B351 for ; Mon, 31 Mar 2003 14:13:25 -0500 (EST) From: "Upayavira" To: cocoon-dev@xml.apache.org Date: Mon, 31 Mar 2003 20:11:51 +0100 MIME-Version: 1.0 Subject: CLI ideas (long) Message-ID: <3E88A107.10784.6DF4D03@localhost> Priority: normal X-mailer: Pegasus Mail for Windows (v4.02) Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Content-description: Mail message body X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Dear All, Below is the a summary of a brief exchange with Nicola Ken regarding CLI ideas I'd like to implement. He has encouraged me to 'go public', which I am now doing. My aim in the below is twofold: make the CLI into something that is useful to a project I am working on, and also to make the CLI into something that people would prefer to use as opposed to something like wget. [Confession: I'm afraid I still use wget myself.] --ModifiableSources-- The Bean's Destination objects will be replaced with ModifiableSources. A destination can be configured with a uri which identifies a ModfiableSource via its protocol. So instead of specifying a Destination, you specify a uri, and a sourceResolver identifies the ModifiableSource. Then all that remains is how to work out the actual uri of a particular output file, based upon the destination uri and the source uri. This can be done by specifying the source uri as a base and a path. The base is used to request a page, but is not appended to the output uri. When the source path should be inserted into the destination URI, the insertion point can be marked with a *, e.g. Ftp://ftp.host.com/htdocs/* zip://path/*@foo.zip If no * is present, and the destination URI ends with a /, the source path is appended to the output uri. If it does not end with a /, the output URI only is used, and the source path is discarded. The source uri is identified by combining the base & the path (separated by a file separator). Exactly how these will be supplied within the xconf, I've still to work out, but each page will need a source base, source path and a destination URI. The Bean gets a ComponentManager from its Cocoon object, and uses this to get a SourceResolver for its own use. Basically, this allows maximum configuration of resulting URIs. At present, for a URI of /site/page.html, this will be put into $dest/site/page.html. But what if you want that page to be at the root of your site, i.e. you don't want 'site' at the beginning? Well, in this case, you'd specify '/site' as the base and 'page.html' as the path. If page.html contained a link to something/anotherpage.html, this would have /site as its base and something/anotherpage.html as its path. So any pages that are linked to will inherit the destination URI and source base from the linking page. ---FTPSource------ I need a writable FTPSource for a project I'm working on. Nicola Ken suggested looking into a VFSSource, which I will do. It shouldn't be hard to produce a ModifiableSource for it. I presume, to make it work, you'd configure multiple protocols in cocoon.xconf to point to the same code, e.g. ftp, zip, smb, etc. [Caching thoughts removed - return to that some time later] ---Configuring a ModifiableSource--- >From all of the Sources I've seen, I haven't seen ways to pass configuration parameters into them. For example, how might one tell an FTPSource to use passive as opposed to active FTP? Any ideas? ---Source Caching------ > Cocoon probably has a lot of code for caching sources. There are two sides to caching, improving processing time by reducing workload, and reducing writing time by not updating pages that have not changed. The former is already managed by Cocoon. The latter would require the pipeline to report if the serializer output was read from the cache. If so, content isn't written. [Nicola Ken - I didn't understand this bit of your reply:] > Actually even the former is managed by Cocoon, I don't remember where but > IIRC the Environment has such an info, only that in the current > implementation of the CLI environments it's unimplemented. As Nicola Ken pointed out, links of every page would need to be cached, because when a page will be found to be already on disk and uptodate, you still need the links for crawling. Hmm. ---Threading--- Threadinq needs reworking as the ThreadedDestination would become deprecated. The bean either needs to have threaded processing built in, or I need to create something like a 'threadable source', using something like threaded:ftp://blah. Don't know which yet. There are two possible forms of threading: generation and dispatch threading. In generation threading, multiple pages are generated simultaneously. The benefit of this is that pages are likely to appear more synchronously at the destination. The downside is that the processor needs to be switching between multiple threads (assuming single processor m/c) Dispatch threading involves sequential page generation and then handing the generated content to a thread-pool to handle dispatching the content to its destination. The benefit of this greater speed of delivery when delivery takes place over a slow network connection. This kind of threading is important for a system that I want to use it for. The pages bear no relevance to each other, and speed of delivery is important. (I don't plan to implement generation threading ATM). Threading is either configured once for an xconf file, or as a part of a threaded source URL. Default would be no threading. Final comments from Nicola Ken: > What about a publish-subscribe model, with complete decoupling from > the publishing and the handling? Can you explain more what you mean by this? > As points that are important, I would say in order: > > 1) make Cocoon *not* output the pages that have an error > 2) make cocoon output xxxpagename.error.txt with the errors > of the 'xxxpagename' page (configurable) > 3) make the report on broken links in XML so that it can be > added to the site (where to put it configurable) > 4) make the content not regenerated if uptodate (very important > from a user perspective POV) > 5) use ModifyableSource instead of Destination > 6) others > > Feel free to do whatever in whatever order you prefer, this is just > what IMVHO is the priority. 1+2 are needed BTW so that crawlers see > broken links correctly, otherwise the site seems ok but instead the > broken links are there. Do you have ideas as to how to do these (i.e. 1-4)? 5 is of greatest importance to me, but if I can understand what is involved in the others, then I can always have a go. Regards, Upayavira