cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Gritsenko <>
Subject Re: [RT] New Cocoon Site Crawler Environment
Date Tue, 17 Dec 2002 15:33:47 GMT
Nicola Ken Barozzi wrote:

> Why is it so slow?
> Mostly because it generates each source three times.
> * to get the links. 

* to get the mime type

> * for each link 

...whose mime type is not known yet...

> to get the mime/type.
> * to get the page itself 

Note: It gets the page with all the links translated using data gathered 
on previous step.

> To do this it uses two environments, the FileSavingEnvironment and the 
> LinkSamplingEnvironment.


> The three calls to Cocoon can be reduced quite easily to two, by 
> making the call to the FileSavingEnvironment return both things at the 
> same time and using those.

Clarify: what two things.

> Or by caching the result as the proposed Ant task in Cocoon scratchpad 
> does.
> The problem arises with the LinkSamplingEnvironment, because it uses a 
> Cocoon view to get the links. Thus we need to ask Cocoon two things, 
> the links and the contents. 

We can combine getType and getLinks calls into one, see below.

> Let's leave aside the view concept for now, and think about how to 
> sample links from a content being produced.
> We can use a LinklSamplingPipeline.
> Yes, a pipeline that introduces a connector just after the 
> "content"-tagged sitemap component and saves the links found in the 
> environment. 

Mmmm... Correction: pipeline that introduces LinkSamplingTransforming 
right before serializer. You can't get links from the content view 
because it might (will) have none yet. Links must be sampled right 
before the serializer, as links view does.

> Thus after the call we would have in the environment the result, the 
> type and the links, all in one call.

Type and links - yes, I agree. Content - no, we won't get correct 
content because links will not be translated in this content. And 
produced content is impossible to "re-link" because it can be any binary 
format supporting links (MS Excel, PDF, MS Word, ...)

But, there is hope to get all in once - if LinkSamplingTransformer will 
also be LinkTranslatingTransformer and will call Main back on every new 
link (recursive processing - as opposed to iterative processing in 
current implementation of the Main). The drawback of recursion approach 
is increased memory consumption.



To unsubscribe, e-mail:
For additional commands, email:

View raw message