lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "none none" <>
Subject Re: Web Crawler
Date Wed, 24 Apr 2002 20:38:07 GMT
Sounds good!
I think will be very useful.
I am writing a crawler too, but it is not complete and multi threaded yet.
My crawler will run on your own machine due to bandwith usage, high cpu usage, disk I/O ,
Also it will work as an NT service and i'll access it by RMI to manage it from a remote machine.

I can tell you in advance that have all the visited links in memory will kill your machine
after about 150'000 links, i tested that, i crawled and after 200'000 links the
cpu was 100%, no response to event,nothing.The best thing? was not even working, because
all the time was wasted to search if the array contains already the current url to make the
decision enqueue/ignore! Same thing to insert or delete a link from the queue!
A database approach i think will be good for that.



On Wed, 24 Apr 2002 21:47:25  
 Clemens Marschner wrote:
>I have been writing a web crawler in Java for quite some time now. Since
>Lucene doesn't contain one by itself, I wonder if you were interested in a
>contribution within the Lucene project.
>I would probably call it a 0.4. It has quite a modular design, it's
>multithreaded and still pretty simple.
>And it's optimized for speed. I spent some time with a profiler to get the
>beast FAST and memory consumption low. It contains an optimized HTML parser
>that just extracts the necessary information and doesn't waste time nor
>I was able to get a maximum of 3.7 MB/sec on a 100MBit line and a MAN-style
>network (a University campus with about 150 web servers).
>Its only purpose is to crawl documents and links and store them somewhere.
>Nothing is done with the documents (though it would be easy to incorporate
>any computation steps, but this would probably shift the balance between IO
>and CPU usage until one of them becomes a bottleneck). Any connection to the
>Lucene engine has yet to be provided.
>I have also made a lot of optimizations on RAM usage, but still some data
>structures are kept in main memory (notably the hash of visited URLs),
>limiting the number of files that can be crawled.
>Since it's not a production release yet, it still has some limitations. Some
>work still has to be done, I still have a lot of ideas, and pretty much of
>the configuration is still made in the Java source code (well, at least,
>most of it is concentrated in the main() method). Since I just used it for
>myself, this was fine so far.
>Clemens Marschner
>To unsubscribe, e-mail:   <>
>For additional commands, e-mail: <>

See Dave Matthews Band live or win a signed guitar 

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message