lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: memory usage - RE: your crawler
Date Fri, 20 Sep 2002 18:51:41 GMT
Otis Gospodnetic wrote:
> Every URL extracted from a fetched document needs to be looked up in
> this VisitedURLsFilter.  If not there already, it needs to be added to
> it (and to the queue of URLs to fetch).  If there already, it is thrown
> away.
> Because of this, the data structure that VisitedURLsFilter uses to
> store and look up URLs must be super fast.
> This means that it cannot be on disk.
> However, a crawler normally encounters hundreds of thousands and
> millions or URLs, so storing them all in this filter wouldn't work (RAM
> issue).
> So the question is how to store such large number of URLs and at the
> same time provide a fast lookup access to it.

One way to speed things up without storing all URLs in RAM would be to 
batch the filtering.  You start with a list of pages to crawl.  As each 
is downloaded, you extract its URLs and add them to a queue of urls to 
be filtered.  Periodically process this queue.  If you sort the queue by 
URL, then merge it with a sorted offline data structure, like a B-Tree, 
then you minimize the amount of I/O required.  This is not as fast as 
keeping things in RAM, but much faster than doing a B-Tree lookup as 
each URL is encountered.


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message