cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <thors...@apache.org>
Subject Re: A new CLI (was Re: [RT] The environment abstraction, part II)
Date Tue, 04 Apr 2006 08:57:30 GMT
El lun, 03-04-2006 a las 12:34 +0100, Upayavira escribió:
> Thorsten Scherler wrote:
> > El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió:
> >> David Crossley wrote:
> >>> Upayavira wrote:
> >>>> Sylvain Wallez wrote:
> >>>>> Carsten Ziegeler wrote:
> >>>>>> Sylvain Wallez wrote:
> >>>>>>> Hmm... the current CLI uses Cocoon's links view to crawl
the website. So
> >>>>>>> although the new crawler can be based on servlets, it will
assume these
> >>>>>>> servlets to answer to a ?cocoon-view=links :-)
> >>>>>>>     
> >>>>>> Hmm, I think we don't need the links view in this case anymore.
A simple
> >>>>>>  HTML crawler should be enough as it will follow all links on
the page.
> >>>>>> The view would only make sense in the case where you don't output
html
> >>>>>> where the usual crawler tools would not work.
> >>>>>>   
> >>>>> In the case of Forrest, you're probably right. Now the links view
also
> >>>>> allows to follow links in pipelines producing something that's not
HTML,
> >>>>> such as PDF, SVG, WML, etc.
> >>>>>
> >>>>> We have to decide if we want to loose this feature.
> >>> I am not sure if we use this in Forrest. If not
> >>> then we probably should be. 
> >>>
> >>>> In my view, the whole idea of crawling (i.e. gathering links from pages)
> >>>> is suboptimal anyway. For example, some sites don't directly link to
all
> >>>> pages (e.g. they are accessed via javascript, or whatever) so you get
> >>>> pages missed.
> >>>>
> >>>> Were I to code a new CLI, whilst I would support crawling I would mainly
> >>>> configure the CLI to get the list of pages to visit by calling one or
> >>>> more URLs. Those URLs would specify the pages to generate.
> >>>>
> >>>> Thus, Forrest would transform its site.xml file into this list of pages,
> >>>> and drive the CLI via that.
> >>> This is what we do do. We have a property
> >>> "start-uri=linkmap.html"
> >>> http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
> >>> (we actually use corresponding xml of course).
> >>>
> >>> We define a few extra URIs in the Cocoon cli.xconf
> >>>
> >>> There are issues of course. Sometimes we want to
> >>> include directories of files that are not referenced
> >>> in site.xml navigation. For my sites i just use a
> >>> DirectoryGenerator to build an index page which feeds
> >>> the crawler. Sometime that technique is not sufficent.
> >>>
> >>> We also gather links from text files (e.g. CSS)
> >>> using Chaperon. This works nicely but introduces
> >>> some overhead.
> >> This more or less confirms my suggested approach - allow crawling at the
> >> 'end-point' HTML, but more importantly, use a page/URL to identify the
> >> pages to be crawled. The interesting thing from what you say is that
> >> this page could itself be nothing more than HTML.
> > 
> > Well, yes and not really, since e.g. Chaperon is text based and no
> > markup. You need a lex-writer to generate links for the crawler. 
> 
> Yes. You misunderstand me I think.

Yes, sorry I did misunderstood you.

>  Even if you use Chaperon etc to parse
> markup, there'd be no difficulty expressing the links that you found as
> an HTML page - one intended to be consumed by the CLI, not to be
> publically viewed.

Well in the case of css you want them as well publically viewed but I
got your point. ;)

>  In fact, if it were written to disc, forrest would
> probably delete it afterwards.
> 
> > Forrest actually is *not* aimed for html only support and one can think
> > of the situation that you want your site to be only txt (kind of a
> > book). Here you need to crawler the lex-rewriter outcome and follow the
> > links.
> 
> Hopefully I've shown that I had understood that already :-)

yeah ;)

> 
> > The current limitation of forrest regarding the crawler are IMO not
> > caused by the crawler design but rather by our (as in forrest) usage of
> > it.
> 
> Yep, fair enough. But if the CLI is going to survive the shift that is
> happening in Cocoon trunk, something big needs to be done by someone. It
> cannot survive in its current form as the code it uses is changing
> almost beyond recognition.
> 
> Heh, perhaps the Cocoon CLI should just be a Maven plugin.

...or forrest plugin. ;) This would makes it possible that cocoon, lenya
and forrest committer can help.

Kind of http://svn.apache.org/viewcvs.cgi/lenya/sandbox/doco/ ;)

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)


Mime
View raw message