lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Galambos <>
Subject High Capacity (Distributed) Crawler
Date Tue, 22 Apr 2003 08:01:10 GMT

I would like to write $SUBJ (HCDC), because LARM does not offer many 
options which are required by web/http crawling IMHO. Here is my list:

1. I would like to manage the decision what will be gathered first - 
this would be based on pageRank, number of errors, connection speed etc. 
2. pure JAVA solution without any DBMS/JDBC
3. better configuration in case of an error
4. NIO style as it is suggested by LARM specification
5. egothor's filters for automatic processing of various data formats
6. management of "Expires" HTTP-meta headers, heuristic rules which will 
describe how fast a page can expire (.php often expires faster than .html)
7. reindexing without any data exports from a full-text index
8. open protocol between the crawler and a full-text engine

If anyone wants to join (or just extend the wish list), let me know, please.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message