lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Halácsy Péter <halacsy.pe...@axelero.com>
Subject RE: Proposal for Lucene / new component
Date Sun, 03 Mar 2002 00:10:13 GMT

> -----Original Message-----
> From: Andrew C. Oliver [mailto:acoliver@apache.org]
> Sent: Tuesday, February 26, 2002 2:13 PM
> To: Lucene Developers List
> Subject: Re: Proposal for Lucene / new component
> 
> 
> Humm.  Well said.  I'm not against using Avalon.  My approach to
> software is this though:  Get a working draft.  Refactor it into that
> *stand the test of time* for your second or third release.  Things
> change...iterate.  Not against a super configurable masterpiece...but
> first I want to crawl and index web pages over httpd in various
> pluggable mime formats.. Once we get there...
> 

Hello,
I had been abroad last week and it took at least 30 min to read the discussion about avalon.
It's great!

Someone mentioned that Avalon is only used by Cocoon. Well, we are using cocoon and I'm very
happy that it is Avalon based. I think that is the main reason of flexibility. BTW Cocoon
uses Lucene, pls refer to http://xml.apache.org/cocoon/userdocs/generators/search-generator.html

I think if you need logging, configuring, threading, pooling (for the crawler) and want to
be component based you need a framework some thing like avalon. It took one day to understand
Avalon and write the first Hello world application but you can save a lot of time while coding.

Iteration is very good practica in software development and can be applied to avalon based
application as well. First you should only write interfaces. First time you can implement
fake component that works like the a real one. After a while you can change the working component
by rewriting the config file.

For example I think the http crawler is built from more than one component:
1. the fetcher that connects to the webserver, gets the page from the url
responsible for: downloading the page as is (handling network errors), handling HTTP status
codes (for example redirects)

configurable by: proxy server, max open sockets

2. component that parses the fetched page and extract relevant metadata

3. a component that is an interface to the loader; it gets the fetched and parsed pages from
the parser (or gets command from the fetcher to delete pages from the search database)

this interface can be implemented in several components:
one that puts the data in files (if the loader and the search db is on other box)
one that gives the data to the loader component (that is in the same JVM)
and so on
 
4. one that feeds urls to the crawler's database 
responsible for: 
extracting links from the dowloaded pages
handling manually submitted urls (submitted by users or sysadmins)
filtering out the exluded urls

configurable by: excluding rules

5. one that reads urls from the database and feed them to the fetcher
the most sophisticated component that responsible for: 
choosing the right url to crawl:
 -  it can use a priority list based on url patterns
 - do not fetch a lot of pages from the same server (max 1 request/min)
 - robots.txt file
configurable by: priority lists, max urls from a host

6. and the last component is the database itself; it can be a JDBC compliant database or something
file system based
responsible for: adding/deleting url to/from the database (url: last fetched date, last HTTP
status code, last action [add or delete])
aswering host related questions: how many urls were fetched from the host, what time was the
last url fetched,  robots.txt of the host

I know it's not a modell of a working http crawler but please notice:
1. using avalon you can change the implementation of a component in 30 seconds (if someone
implemented it ;)
2. you don't have to work on implementing logging, configuration system, database pooling
for JDBC 
3. the crawler is a component that needs no information about the search database (and the
loader/indexer dosn't know the crawler)
4. the parser and loader interface component can be used in file based HTML crawler (that
reads static HTML pages from the directory of the webserver in [if the engine is used in intranet])
5. having different loader components you can built a search engine for simple JVM or for
distributed system (and you do not need to implement in the first iteration cycle)

OK, this mail is already too long and I'm tired.

peter

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message