cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernhard Huber <>
Subject Crawler/Indexer redesign
Date Sat, 02 Feb 2002 19:07:07 GMT

As I'm not totally happy with the Crawler, Indexer component interfaces 
I want to address issues here:

Today CocoonCrawler exposes:
 void crawl(URL), and Iterator iterator();
crawl sets the base url, and iterator() delivers one more URL reachable 
from the base url.
I have some head-aches using URL objects in the commandline environment.
The only simple possibility is to use file: URLs which implicits storing 
the xml document which has been crawled to the filesystem. But storing 
it to the filesystem I want to avoid for sake of performance.

Thus I was thinking changing the interface to:
void crawl(Source) , and Iterator iterator();
Thus working with Source objects instead of URL objects.

The LuceneCocoonIndexer should also change from using URL to using Source.

The main reason for this change is implementing crawling and indexing 
today works only using the http: protocol.
If you want to index xml documents of the local cocoon, or if you want 
to create an index in the command line version of Cocoon, you may not be 
able to use the http protocol.
Thus I was thinking about using Source.

Perhaps someone having a broader, and more detailed understanding of the 
Cocoon internas could help me a bit.

bye bernhard

To unsubscribe, e-mail:
For additional commands, email:

View raw message