cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernhard Huber <>
Subject Re: [RT] New Cocoon Site Crawler Environment
Date Tue, 17 Dec 2002 20:25:26 GMT

Nicola Ken Barozzi wrote:
> Of all these discussions, one thing sticks out: we must 
> rewrite/fix/enhance/whatever the Cocoon crawler.
> Reasons:
>  - speed
>  - correct link gathering
> but mostly
>  - speed
> Why is it so slow?
> Mostly because it generates each source three times.
> * to get the links.
> * for each link to get the mime/type.
> * to get the page itself
> To do this it uses two environments, the FileSavingEnvironment and the 
> LinkSamplingEnvironment.
>                          {~}
> I've taken a look at the crawler project in Lucene sandbox, but its 
> objectives are totally different from ours. We could in the future add a 
> plugin to it to be able to index a Cocoon site using the link view, but 
> it does indexing, not saving a site locally.
> So our option is to do the work in Cocoon.
>                          {~}
> The three calls to Cocoon can be reduced quite easily to two, by making 
> the call to the FileSavingEnvironment return both things at the same 
> time and using those. Or by caching the result as the proposed Ant task 
> in Cocoon scratchpad does.

> The problem arises with the LinkSamplingEnvironment, because it uses a 
> Cocoon view to get the links. Thus we need to ask Cocoon two things, the 
> links and the contents.
<big snip/>

ask Cocoon two things, make a Generator/Transformer to do the two thinks,

I now play around with a SourceLinkStatusGenerator, which is like
StatusGenerator but does not request the links of a page via http: call,
but via processor.process() call, it does it recursivly, does you ask
SourceLinkStatusGenerator give me all links outbounded links of 
index.html, and it will return an xml document with all links of the 
pages reachable from index.html.

You ask Cocoon give me the content of page index.html plus its out 
bounding links.

The only problem I see you will get not text/html if you ask Cocoon this
question but text/html+application/x-cocoon-links response - taking the 
index.html example of above.

Moreover you might have to adopt the sitemap to let's
<map:match pattern="crawling"> and asking within this map:match
cocoon the right question?

Hmm, if you rely on links, you might want LinkTransformer, not to throw 
away the page content, but to harvest the links content-no-destructive.
Hmm, that would be the best no big sitemap changes, just another
transforming step, instead of type="xslt" src="linkstatus.xslt"
the new LinkAndContentTransformer step, but the content-type issue stays.

btw, thxs for starting this RT, i don't have the passion to initiate 
this, but it is neccessary, and i appreciate it.

bye bernhard

To unsubscribe, e-mail:
For additional commands, email:

View raw message