Return-Path: Delivered-To: apmail-cocoon-dev-archive@cocoon.apache.org Received: (qmail 99690 invoked by uid 500); 2 Aug 2003 12:01:10 -0000 Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: dev@cocoon.apache.org Delivered-To: mailing list dev@cocoon.apache.org Received: (qmail 99675 invoked from network); 2 Aug 2003 12:01:09 -0000 Received: from grunt23.ihug.com.au (203.109.249.143) by daedalus.apache.org with SMTP; 2 Aug 2003 12:01:09 -0000 Received: from p1145-apx1.syd.ihug.com.au (expresso.localdomain) [203.173.150.129] by grunt23.ihug.com.au with esmtp (Exim 3.35 #1 (Debian)) id 19iv4N-0001xd-00; Sat, 02 Aug 2003 22:01:08 +1000 Received: from jeff by expresso.localdomain with local (Exim 3.35 #1 (Debian)) id 19ivBN-0002qe-00 for ; Sat, 02 Aug 2003 22:08:21 +1000 Date: Sat, 2 Aug 2003 22:08:21 +1000 From: Jeff Turner To: dev@cocoon.apache.org Subject: cli.xconf questions Message-ID: <20030802120821.GD3950@expresso.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.4i X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi, I'm tinkering around with the CLI, thinking how to add don't-crawl-this-page support, and have some questions on how cli.xconf currently works. The following block in cli.xconf has me confused.. | The old behaviour - appends uri to the specified destination | directory (as specified in ): | | documents/index.html Do we still want this ... behaviour? Currently the CLI only accepts . Come to think of it, the attribute name 'src' doesn't really make sense. What is the "source" of a Cocoon URI? It would be the XML (documents/index.xml), which is not what we're specifying in @src. | Append: append the generated page's URI to the end of the | source URI: | | What is a 'source URI' here, and why would we want to append another URI (URIs are not additive)? Does this mean documents/index.html would be written to build/dest/? If so, why separate @src-prefix and @src? | | Replace: Completely ignore the generated page's URI - just | use the destination URI: | | Sounds fine, but again, since we know the whole URI (documents/index.html), why separate into @src-prefix and @src? | | Insert: Insert generated page's URI into the destination | URI at the point marked with a * (example uses fictional | zip protocol) | | Leaves me very confused.. what would be the result here? An index.zip file, containing the bytes from documents/index.html saved as page.html? Is there a non-fictional scenario where this makes more sense? :) Anyway, on to the subject of excluding certain URIs.. are there any preferred ways of doing it? I've currently got: .... working, which seems crude but effective. Ideally I'd like to: - Use wildcards ("don't crawl '*.xml' URLs") - be able to exclude links based on which page they originate from ("ignore broken links from sitemap-ref.html") I was thinking of some sort of nesting notation for indicating links from a certain page: *.xml Sorry I don't have any answers or even particularly coherent questions ;) I have the feeling that cli.xconf's job, mapping URIs to the filesystem, could potentially be quite intricate. It is roughly an inverse of what the sitemap does. Perhaps we need an analogous syntax? --Jeff