cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira ...@odoko.co.uk>
Subject Re: A new CLI (was Re: [RT] The environment abstraction, part II)
Date Tue, 04 Apr 2006 09:04:48 GMT
Thorsten Scherler wrote:
> El lun, 03-04-2006 a las 12:34 +0100, Upayavira escribió:
>> Thorsten Scherler wrote:
>>> El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió:
>>>> David Crossley wrote:
>>>>> Upayavira wrote:
>>>>>> Sylvain Wallez wrote:
>>>>>>> Carsten Ziegeler wrote:
>>>>>>>> Sylvain Wallez wrote:
>>>>>>>>> Hmm... the current CLI uses Cocoon's links view to crawl
the website. So
>>>>>>>>> although the new crawler can be based on servlets, it
will assume these
>>>>>>>>> servlets to answer to a ?cocoon-view=links :-)
>>>>>>>>>     
>>>>>>>> Hmm, I think we don't need the links view in this case anymore.
A simple
>>>>>>>>  HTML crawler should be enough as it will follow all links
on the page.
>>>>>>>> The view would only make sense in the case where you don't
output html
>>>>>>>> where the usual crawler tools would not work.
>>>>>>>>   
>>>>>>> In the case of Forrest, you're probably right. Now the links
view also
>>>>>>> allows to follow links in pipelines producing something that's
not HTML,
>>>>>>> such as PDF, SVG, WML, etc.
>>>>>>>
>>>>>>> We have to decide if we want to loose this feature.
>>>>> I am not sure if we use this in Forrest. If not
>>>>> then we probably should be. 
>>>>>
>>>>>> In my view, the whole idea of crawling (i.e. gathering links from
pages)
>>>>>> is suboptimal anyway. For example, some sites don't directly link
to all
>>>>>> pages (e.g. they are accessed via javascript, or whatever) so you
get
>>>>>> pages missed.
>>>>>>
>>>>>> Were I to code a new CLI, whilst I would support crawling I would
mainly
>>>>>> configure the CLI to get the list of pages to visit by calling one
or
>>>>>> more URLs. Those URLs would specify the pages to generate.
>>>>>>
>>>>>> Thus, Forrest would transform its site.xml file into this list of
pages,
>>>>>> and drive the CLI via that.
>>>>> This is what we do do. We have a property
>>>>> "start-uri=linkmap.html"
>>>>> http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
>>>>> (we actually use corresponding xml of course).
>>>>>
>>>>> We define a few extra URIs in the Cocoon cli.xconf
>>>>>
>>>>> There are issues of course. Sometimes we want to
>>>>> include directories of files that are not referenced
>>>>> in site.xml navigation. For my sites i just use a
>>>>> DirectoryGenerator to build an index page which feeds
>>>>> the crawler. Sometime that technique is not sufficent.
>>>>>
>>>>> We also gather links from text files (e.g. CSS)
>>>>> using Chaperon. This works nicely but introduces
>>>>> some overhead.
>>>> This more or less confirms my suggested approach - allow crawling at the
>>>> 'end-point' HTML, but more importantly, use a page/URL to identify the
>>>> pages to be crawled. The interesting thing from what you say is that
>>>> this page could itself be nothing more than HTML.
>>> Well, yes and not really, since e.g. Chaperon is text based and no
>>> markup. You need a lex-writer to generate links for the crawler. 
>> Yes. You misunderstand me I think.
> 
> Yes, sorry I did misunderstood you.
> 
>>  Even if you use Chaperon etc to parse
>> markup, there'd be no difficulty expressing the links that you found as
>> an HTML page - one intended to be consumed by the CLI, not to be
>> publically viewed.
> 
> Well in the case of css you want them as well publically viewed but I
> got your point. ;)
> 
>>  In fact, if it were written to disc, forrest would
>> probably delete it afterwards.
>>
>>> Forrest actually is *not* aimed for html only support and one can think
>>> of the situation that you want your site to be only txt (kind of a
>>> book). Here you need to crawler the lex-rewriter outcome and follow the
>>> links.
>> Hopefully I've shown that I had understood that already :-)
> 
> yeah ;)
> 
>>> The current limitation of forrest regarding the crawler are IMO not
>>> caused by the crawler design but rather by our (as in forrest) usage of
>>> it.
>> Yep, fair enough. But if the CLI is going to survive the shift that is
>> happening in Cocoon trunk, something big needs to be done by someone. It
>> cannot survive in its current form as the code it uses is changing
>> almost beyond recognition.
>>
>> Heh, perhaps the Cocoon CLI should just be a Maven plugin.
> 
> ...or forrest plugin. ;) This would makes it possible that cocoon, lenya
> and forrest committer can help.
> 
> Kind of http://svn.apache.org/viewcvs.cgi/lenya/sandbox/doco/ ;)

Well, in the end, it is he who implements that decides.

Upayavira

Mime
View raw message