forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicola Ken Barozzi <nicola...@apache.org>
Subject [RT] New Cocoon Site Crawler Environment
Date Tue, 17 Dec 2002 14:52:38 GMT

Of all these discussions, one thing sticks out: we must 
rewrite/fix/enhance/whatever the Cocoon crawler.

Reasons:

  - speed
  - correct link gathering

but mostly

  - speed

Why is it so slow?
Mostly because it generates each source three times.

* to get the links.
* for each link to get the mime/type.
* to get the page itself

To do this it uses two environments, the FileSavingEnvironment and the 
LinkSamplingEnvironment.


                          {~}


I've taken a look at the crawler project in Lucene sandbox, but its 
objectives are totally different from ours. We could in the future add a 
plugin to it to be able to index a Cocoon site using the link view, but 
it does indexing, not saving a site locally.
So our option is to do the work in Cocoon.


                          {~}


The three calls to Cocoon can be reduced quite easily to two, by making 
the call to the FileSavingEnvironment return both things at the same 
time and using those. Or by caching the result as the proposed Ant task 
in Cocoon scratchpad does.

The problem arises with the LinkSamplingEnvironment, because it uses a 
Cocoon view to get the links. Thus we need to ask Cocoon two things, the 
links and the contents.

Let's leave aside the view concept for now, and think about how to 
sample links from a content being produced.

We can use a LinklSamplingPipeline.
Yes, a pipeline that introduces a connector just after the 
"content"-tagged sitemap component and saves the links found in the 
environment.

Thus after the call we would have in the environment the result, the 
type and the links, all in one call.

In essence, we are creating a non-blocking view that runs parallelly to 
the main pipeline and reports the results to the environment.

This is how views are managed in the interpreted sitemap, in a transformer:


         // Check view
         if (this.views != null) {
	
             //inform the pipeline that we have a branch point
             context.getProcessingPipeline().informBranchPoint();
	
             String cocoonView = env.getView();
             if (cocoonView != null) {

                 // Get view node
                 ProcessingNode viewNode =
                    (ProcessingNode)this.views.get(cocoonView);

                 if (viewNode != null) {
                     if (getLogger().isInfoEnabled()) {
                         getLogger().info("Jumping to view "
                            + cocoonView + " from transformer at "
                            + this.getLocation());
                     }
                     return viewNode.invoke(env, context);
                 }
             }
         }

         // Return false to contine sitemap invocation
         return false;
     }

It effectively branches and continues only with the view.

Wait, this means that when the CLI recreates a site it doesn't save the 
views, right?
Correct, views are simply ignored by the CLI and not created on disk. 
This is also due to how views are invoken in Cocoon, with a ? parameter, 
so they cannot be saved to disk with the correct URL.

But even if I don't save it, I may need it for internal Cocoon 
processing, as is that case with the crawler.

I don't know if it's best to use a special pipeline, to cache the views, 
  or what, but we need to find a solution.

Any idea?

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


Mime
View raw message