cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicola Ken Barozzi <>
Subject Re: [RT] New Cocoon Site Crawler Environment
Date Tue, 17 Dec 2002 16:05:28 GMT

Vadim Gritsenko wrote:
> Nicola Ken Barozzi wrote:
> ...
>> Why is it so slow?
>> Mostly because it generates each source three times.


> Note: It gets the page with all the links translated using data gathered 
> on previous step.


> We can combine getType and getLinks calls into one, see below.
>> Let's leave aside the view concept for now, and think about how to 
>> sample links from a content being produced.
>> We can use a LinklSamplingPipeline.
>> Yes, a pipeline that introduces a connector just after the 
>> "content"-tagged sitemap component and saves the links found in the 
>> environment. 
> Mmmm... Correction: pipeline that introduces LinkSamplingTransforming 
> right before serializer. You can't get links from the content view 
> because it might (will) have none yet. Links must be sampled right 
> before the serializer, as links view does.

The link view can be set to kick in at any part of the pipeline, it's 
always SAX.
It's up to the sitemap editor to tell which step is the semantically 
rich one. Can be the first, in the middle, or right before the Serializer.

>> Thus after the call we would have in the environment the result, the 
>> type and the links, all in one call.
> Type and links - yes, I agree. Content - no, we won't get correct 
> content because links will not be translated in this content. And 
> produced content is impossible to "re-link" because it can be any binary 
> format supporting links (MS Excel, PDF, MS Word, ...)

Ok, you are correct.

Please add here the results we have come to in our fast AIM discussion, 
I have to run now.

Thanks :-)

> But, there is hope to get all in once - if LinkSamplingTransformer will 
> also be LinkTranslatingTransformer and will call Main back on every new 
> link (recursive processing - as opposed to iterative processing in 
> current implementation of the Main). The drawback of recursion approach 
> is increased memory consumption.

NAO = not an option

It doesn't scale, you are right.

Nicola Ken Barozzi         
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)

To unsubscribe, e-mail:
For additional commands, email:

View raw message