Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cocoon.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Message-ID: <443236B0.8020001@odoko.co.uk>
Date: Tue, 04 Apr 2006 10:04:48 +0100
From: Upayavira <uv@odoko.co.uk>
Organization: Odoko Ltd
User-Agent: Thunderbird 1.5 (X11/20060309)
MIME-Version: 1.0
To: dev@cocoon.apache.org
Subject: Re: A new CLI (was Re: [RT] The environment abstraction, part II)
References: <20060329063931.GB19361@igg.indexgeo.com.au>
	 <442A343A.1090206@apache.org>
 <20060331033201.GA13559@igg.indexgeo.com.au>	 <442D8462.6070103@odoko.co.uk>
 <442E2C2B.705@apache.org>	 <442ED45A.2020509@odoko.co.uk>
 <442F84AB.6020400@apache.org>	 <442FC730.2030503@apache.org>
 <442FCB91.9020503@apache.org>	 <443026D6.6060403@odoko.co.uk>
 <20060403020013.GH25758@igg.indexgeo.com.au>	 <4430D617.3060701@odoko.co.uk>
 <1144061712.8522.8.camel@localhost>	 <4431084C.4060205@odoko.co.uk>
 <1144141050.8484.41.camel@localhost>
In-Reply-To: <1144141050.8484.41.camel@localhost>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Thorsten Scherler wrote:
> El lun, 03-04-2006 a las 12:34 +0100, Upayavira escribió:
>> Thorsten Scherler wrote:
>>> El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió:
>>>> David Crossley wrote:
>>>>> Upayavira wrote:
>>>>>> Sylvain Wallez wrote:
>>>>>>> Carsten Ziegeler wrote:
>>>>>>>> Sylvain Wallez wrote:
>>>>>>>>> Hmm... the current CLI uses Cocoon's links view to crawl the website. So
>>>>>>>>> although the new crawler can be based on servlets, it will assume these
>>>>>>>>> servlets to answer to a ?cocoon-view=links :-)
>>>>>>>>>     
>>>>>>>> Hmm, I think we don't need the links view in this case anymore. A simple
>>>>>>>>  HTML crawler should be enough as it will follow all links on the page.
>>>>>>>> The view would only make sense in the case where you don't output html
>>>>>>>> where the usual crawler tools would not work.
>>>>>>>>   
>>>>>>> In the case of Forrest, you're probably right. Now the links view also
>>>>>>> allows to follow links in pipelines producing something that's not HTML,
>>>>>>> such as PDF, SVG, WML, etc.
>>>>>>>
>>>>>>> We have to decide if we want to loose this feature.
>>>>> I am not sure if we use this in Forrest. If not
>>>>> then we probably should be. 
>>>>>
>>>>>> In my view, the whole idea of crawling (i.e. gathering links from pages)
>>>>>> is suboptimal anyway. For example, some sites don't directly link to all
>>>>>> pages (e.g. they are accessed via javascript, or whatever) so you get
>>>>>> pages missed.
>>>>>>
>>>>>> Were I to code a new CLI, whilst I would support crawling I would mainly
>>>>>> configure the CLI to get the list of pages to visit by calling one or
>>>>>> more URLs. Those URLs would specify the pages to generate.
>>>>>>
>>>>>> Thus, Forrest would transform its site.xml file into this list of pages,
>>>>>> and drive the CLI via that.
>>>>> This is what we do do. We have a property
>>>>> "start-uri=linkmap.html"
>>>>> http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
>>>>> (we actually use corresponding xml of course).
>>>>>
>>>>> We define a few extra URIs in the Cocoon cli.xconf
>>>>>
>>>>> There are issues of course. Sometimes we want to
>>>>> include directories of files that are not referenced
>>>>> in site.xml navigation. For my sites i just use a
>>>>> DirectoryGenerator to build an index page which feeds
>>>>> the crawler. Sometime that technique is not sufficent.
>>>>>
>>>>> We also gather links from text files (e.g. CSS)
>>>>> using Chaperon. This works nicely but introduces
>>>>> some overhead.
>>>> This more or less confirms my suggested approach - allow crawling at the
>>>> 'end-point' HTML, but more importantly, use a page/URL to identify the
>>>> pages to be crawled. The interesting thing from what you say is that
>>>> this page could itself be nothing more than HTML.
>>> Well, yes and not really, since e.g. Chaperon is text based and no
>>> markup. You need a lex-writer to generate links for the crawler. 
>> Yes. You misunderstand me I think.
> 
> Yes, sorry I did misunderstood you.
> 
>>  Even if you use Chaperon etc to parse
>> markup, there'd be no difficulty expressing the links that you found as
>> an HTML page - one intended to be consumed by the CLI, not to be
>> publically viewed.
> 
> Well in the case of css you want them as well publically viewed but I
> got your point. ;)
> 
>>  In fact, if it were written to disc, forrest would
>> probably delete it afterwards.
>>
>>> Forrest actually is *not* aimed for html only support and one can think
>>> of the situation that you want your site to be only txt (kind of a
>>> book). Here you need to crawler the lex-rewriter outcome and follow the
>>> links.
>> Hopefully I've shown that I had understood that already :-)
> 
> yeah ;)
> 
>>> The current limitation of forrest regarding the crawler are IMO not
>>> caused by the crawler design but rather by our (as in forrest) usage of
>>> it.
>> Yep, fair enough. But if the CLI is going to survive the shift that is
>> happening in Cocoon trunk, something big needs to be done by someone. It
>> cannot survive in its current form as the code it uses is changing
>> almost beyond recognition.
>>
>> Heh, perhaps the Cocoon CLI should just be a Maven plugin.
> 
> ...or forrest plugin. ;) This would makes it possible that cocoon, lenya
> and forrest committer can help.
> 
> Kind of http://svn.apache.org/viewcvs.cgi/lenya/sandbox/doco/ ;)

Well, in the end, it is he who implements that decides.

Upayavira