lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "none none" <kor...@lycos.com>
Subject Re: Web Crawler
Date Wed, 24 Apr 2002 20:38:07 GMT
Sounds good!
I think will be very useful.
I am writing a crawler too, but it is not complete and multi threaded yet.
My crawler will run on your own machine due to bandwith usage, high cpu usage, disk I/O ,
etc. 
Also it will work as an NT service and i'll access it by RMI to manage it from a remote machine.

I can tell you in advance that have all the visited links in memory will kill your machine
after about 150'000 links, i tested that, i crawled amazon.com and after 200'000 links the
cpu was 100%, no response to event,nothing.The best thing? ...it was not even working, because
all the time was wasted to search if the array contains already the current url to make the
decision enqueue/ignore! Same thing to insert or delete a link from the queue!
Nice.
A database approach i think will be good for that.

Bye

--

On Wed, 24 Apr 2002 21:47:25  
 Clemens Marschner wrote:
>Hi,
>
>I have been writing a web crawler in Java for quite some time now. Since
>Lucene doesn't contain one by itself, I wonder if you were interested in a
>contribution within the Lucene project.
>
>I would probably call it a 0.4. It has quite a modular design, it's
>multithreaded and still pretty simple.
>
>And it's optimized for speed. I spent some time with a profiler to get the
>beast FAST and memory consumption low. It contains an optimized HTML parser
>that just extracts the necessary information and doesn't waste time nor
>objects.
>
>I was able to get a maximum of 3.7 MB/sec on a 100MBit line and a MAN-style
>network (a University campus with about 150 web servers).
>
>Its only purpose is to crawl documents and links and store them somewhere.
>Nothing is done with the documents (though it would be easy to incorporate
>any computation steps, but this would probably shift the balance between IO
>and CPU usage until one of them becomes a bottleneck). Any connection to the
>Lucene engine has yet to be provided.
>
>I have also made a lot of optimizations on RAM usage, but still some data
>structures are kept in main memory (notably the hash of visited URLs),
>limiting the number of files that can be crawled.
>
>Since it's not a production release yet, it still has some limitations. Some
>work still has to be done, I still have a lot of ideas, and pretty much of
>the configuration is still made in the Java source code (well, at least,
>most of it is concentrated in the main() method). Since I just used it for
>myself, this was fine so far.
>
>Cheers,
>
>Clemens Marschner
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>


See Dave Matthews Band live or win a signed guitar
http://r.lycos.com/r/bmgfly_mail_dmb/http://win.ipromotions.com/lycos_020201/splash.asp 

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message