lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clemens Marschner" <c...@lanlab.de>
Subject Re: Web Crawler
Date Wed, 24 Apr 2002 21:13:32 GMT
> I can tell you in advance that have all the visited links in memory will
kill your machine after about 150'000 links, i tested that, i crawled
amazon.com and after 200'000 links the cpu was 100%, no response to
event,nothing.The best thing?
<

It's not that bad with my crawler. I crawled 600.000 docs recently and from
what I recall the mem usage was somewhere between 150 and 200 MB. But that's
still too much.

>
...it was not even working, because all the time was wasted to search if the
array contains already the current url to make the decision enqueue/ignore!
Same thing to insert or delete a link from the queue!
<

sounds like you don't use a HashMap?

The queue is easier; I wrote a caching queue that only holds small blocks of
the queue(s) in RAM and keeps most of it in files.

> A database approach i think will be good for that.

No, I fear that's far too slow.
The problem with that is that both inserts and lookup have to be fast and
(since docs can point to any URL on the planet) both take place at random
positions in the whole URL collection. That means you can't put parts of it
on disk and keep the rest in RAM without losing too much performance.
The solution I see is keeping the information of complete hosts on disk and
only keep a certain number of hosts in RAM that can be handled at once. Then
even the decision whether a doc has already be crawled has to be queued
until the host info is loaded back into memory.
Another thing I have in mind is to compress the URLs in memory. First of
all, the URL can be divided in several parts, some of which occur in a lot
of URLs (i.e. the host name). And finally, URLs contain only a limited
number of different characters, so Huffman encoding is probably quite
efficient.


Clemens


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message