lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: High Capacity (Distributed) Crawler
Date Mon, 09 Jun 2003 19:44:47 GMT
Leo,

Have you started this project?  Where is it hosted?
It would be nice to see a few alternative implementations of a robust
and scalable java web crawler with the ability to index whatever it
fetches.

Thanks,
Otis

--- Leo Galambos <Leo.G@seznam.cz> wrote:
> Hi.
> 
> I would like to write $SUBJ (HCDC), because LARM does not offer many 
> options which are required by web/http crawling IMHO. Here is my
> list:
> 
> 1. I would like to manage the decision what will be gathered first - 
> this would be based on pageRank, number of errors, connection speed
> etc. 
> etc.
> 2. pure JAVA solution without any DBMS/JDBC
> 3. better configuration in case of an error
> 4. NIO style as it is suggested by LARM specification
> 5. egothor's filters for automatic processing of various data formats
> 6. management of "Expires" HTTP-meta headers, heuristic rules which
> will 
> describe how fast a page can expire (.php often expires faster than
> .html)
> 7. reindexing without any data exports from a full-text index
> 8. open protocol between the crawler and a full-text engine
> 
> If anyone wants to join (or just extend the wish list), let me know,
> please.
> 
> -g-
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message