Return-Path: Delivered-To: apmail-cocoon-dev-archive@www.apache.org Received: (qmail 84200 invoked from network); 4 Apr 2006 09:05:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 4 Apr 2006 09:05:15 -0000 Received: (qmail 66286 invoked by uid 500); 4 Apr 2006 09:05:13 -0000 Delivered-To: apmail-cocoon-dev-archive@cocoon.apache.org Received: (qmail 66229 invoked by uid 500); 4 Apr 2006 09:05:13 -0000 Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: Reply-To: dev@cocoon.apache.org List-Id: Delivered-To: mailing list dev@cocoon.apache.org Received: (qmail 66218 invoked by uid 99); 4 Apr 2006 09:05:13 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Apr 2006 02:05:13 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [66.111.4.25] (HELO out1.smtp.messagingengine.com) (66.111.4.25) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Apr 2006 02:05:12 -0700 Received: from frontend2.internal (frontend2.internal [10.202.2.151]) by frontend1.messagingengine.com (Postfix) with ESMTP id 73595D45865 for ; Tue, 4 Apr 2006 05:04:50 -0400 (EDT) Received: from frontend3.messagingengine.com ([10.202.2.152]) by frontend2.internal (MEProxy); Tue, 04 Apr 2006 05:04:43 -0400 X-Sasl-enc: /K5bBml+aLMzj5uy0+njE8YWirIkwWQL4lhP7M390wPd 1144141482 Received: from [10.0.0.3] (host-87-74-127-123.bulldogdsl.com [87.74.127.123]) by www.fastmail.fm (Postfix) with ESMTP id 6D03820B0 for ; Tue, 4 Apr 2006 05:04:42 -0400 (EDT) Message-ID: <443236B0.8020001@odoko.co.uk> Date: Tue, 04 Apr 2006 10:04:48 +0100 From: Upayavira Organization: Odoko Ltd User-Agent: Thunderbird 1.5 (X11/20060309) MIME-Version: 1.0 To: dev@cocoon.apache.org Subject: Re: A new CLI (was Re: [RT] The environment abstraction, part II) References: <20060329063931.GB19361@igg.indexgeo.com.au> <442A343A.1090206@apache.org> <20060331033201.GA13559@igg.indexgeo.com.au> <442D8462.6070103@odoko.co.uk> <442E2C2B.705@apache.org> <442ED45A.2020509@odoko.co.uk> <442F84AB.6020400@apache.org> <442FC730.2030503@apache.org> <442FCB91.9020503@apache.org> <443026D6.6060403@odoko.co.uk> <20060403020013.GH25758@igg.indexgeo.com.au> <4430D617.3060701@odoko.co.uk> <1144061712.8522.8.camel@localhost> <4431084C.4060205@odoko.co.uk> <1144141050.8484.41.camel@localhost> In-Reply-To: <1144141050.8484.41.camel@localhost> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Thorsten Scherler wrote: > El lun, 03-04-2006 a las 12:34 +0100, Upayavira escribió: >> Thorsten Scherler wrote: >>> El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió: >>>> David Crossley wrote: >>>>> Upayavira wrote: >>>>>> Sylvain Wallez wrote: >>>>>>> Carsten Ziegeler wrote: >>>>>>>> Sylvain Wallez wrote: >>>>>>>>> Hmm... the current CLI uses Cocoon's links view to crawl the website. So >>>>>>>>> although the new crawler can be based on servlets, it will assume these >>>>>>>>> servlets to answer to a ?cocoon-view=links :-) >>>>>>>>> >>>>>>>> Hmm, I think we don't need the links view in this case anymore. A simple >>>>>>>> HTML crawler should be enough as it will follow all links on the page. >>>>>>>> The view would only make sense in the case where you don't output html >>>>>>>> where the usual crawler tools would not work. >>>>>>>> >>>>>>> In the case of Forrest, you're probably right. Now the links view also >>>>>>> allows to follow links in pipelines producing something that's not HTML, >>>>>>> such as PDF, SVG, WML, etc. >>>>>>> >>>>>>> We have to decide if we want to loose this feature. >>>>> I am not sure if we use this in Forrest. If not >>>>> then we probably should be. >>>>> >>>>>> In my view, the whole idea of crawling (i.e. gathering links from pages) >>>>>> is suboptimal anyway. For example, some sites don't directly link to all >>>>>> pages (e.g. they are accessed via javascript, or whatever) so you get >>>>>> pages missed. >>>>>> >>>>>> Were I to code a new CLI, whilst I would support crawling I would mainly >>>>>> configure the CLI to get the list of pages to visit by calling one or >>>>>> more URLs. Those URLs would specify the pages to generate. >>>>>> >>>>>> Thus, Forrest would transform its site.xml file into this list of pages, >>>>>> and drive the CLI via that. >>>>> This is what we do do. We have a property >>>>> "start-uri=linkmap.html" >>>>> http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html >>>>> (we actually use corresponding xml of course). >>>>> >>>>> We define a few extra URIs in the Cocoon cli.xconf >>>>> >>>>> There are issues of course. Sometimes we want to >>>>> include directories of files that are not referenced >>>>> in site.xml navigation. For my sites i just use a >>>>> DirectoryGenerator to build an index page which feeds >>>>> the crawler. Sometime that technique is not sufficent. >>>>> >>>>> We also gather links from text files (e.g. CSS) >>>>> using Chaperon. This works nicely but introduces >>>>> some overhead. >>>> This more or less confirms my suggested approach - allow crawling at the >>>> 'end-point' HTML, but more importantly, use a page/URL to identify the >>>> pages to be crawled. The interesting thing from what you say is that >>>> this page could itself be nothing more than HTML. >>> Well, yes and not really, since e.g. Chaperon is text based and no >>> markup. You need a lex-writer to generate links for the crawler. >> Yes. You misunderstand me I think. > > Yes, sorry I did misunderstood you. > >> Even if you use Chaperon etc to parse >> markup, there'd be no difficulty expressing the links that you found as >> an HTML page - one intended to be consumed by the CLI, not to be >> publically viewed. > > Well in the case of css you want them as well publically viewed but I > got your point. ;) > >> In fact, if it were written to disc, forrest would >> probably delete it afterwards. >> >>> Forrest actually is *not* aimed for html only support and one can think >>> of the situation that you want your site to be only txt (kind of a >>> book). Here you need to crawler the lex-rewriter outcome and follow the >>> links. >> Hopefully I've shown that I had understood that already :-) > > yeah ;) > >>> The current limitation of forrest regarding the crawler are IMO not >>> caused by the crawler design but rather by our (as in forrest) usage of >>> it. >> Yep, fair enough. But if the CLI is going to survive the shift that is >> happening in Cocoon trunk, something big needs to be done by someone. It >> cannot survive in its current form as the code it uses is changing >> almost beyond recognition. >> >> Heh, perhaps the Cocoon CLI should just be a Maven plugin. > > ...or forrest plugin. ;) This would makes it possible that cocoon, lenya > and forrest committer can help. > > Kind of http://svn.apache.org/viewcvs.cgi/lenya/sandbox/doco/ ;) Well, in the end, it is he who implements that decides. Upayavira