lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Galambos <Le...@seznam.cz>
Subject Re: High Capacity (Distributed) Crawler
Date Mon, 09 Jun 2003 21:56:20 GMT
Hi Otis.

The first beta is done (without NIO). It needs, however, further 
testing. Unfortunatelly, I could not find enough servers which I may hit.

I wanted to commit the robot as a part of egothor (it will use it in 
PULL mode), but we have a nice weather here, so I lost any motivation to 
play with PC ;-).

What interface do you need for Lucene? Will you use PUSH (=the robot 
will modify Lucene's index) or PULL (=the engine will get deltas from 
the robot) mode? Tell me what you need and I will try to do all my best.

-g-


Otis Gospodnetic wrote:

>Leo,
>
>Have you started this project?  Where is it hosted?
>It would be nice to see a few alternative implementations of a robust
>and scalable java web crawler with the ability to index whatever it
>fetches.
>
>Thanks,
>Otis
>
>--- Leo Galambos <Leo.G@seznam.cz> wrote:
>  
>
>>Hi.
>>
>>I would like to write $SUBJ (HCDC), because LARM does not offer many 
>>options which are required by web/http crawling IMHO. Here is my
>>list:
>>
>>1. I would like to manage the decision what will be gathered first - 
>>this would be based on pageRank, number of errors, connection speed
>>etc. 
>>etc.
>>2. pure JAVA solution without any DBMS/JDBC
>>3. better configuration in case of an error
>>4. NIO style as it is suggested by LARM specification
>>5. egothor's filters for automatic processing of various data formats
>>6. management of "Expires" HTTP-meta headers, heuristic rules which
>>will 
>>describe how fast a page can expire (.php often expires faster than
>>.html)
>>7. reindexing without any data exports from a full-text index
>>8. open protocol between the crawler and a full-text engine
>>
>>If anyone wants to join (or just extend the wish list), let me know,
>>please.
>>
>>-g-
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>
>__________________________________
>Do you Yahoo!?
>Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
>http://calendar.yahoo.com
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>  
>




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message