lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: memory usage - RE: your crawler
Date Fri, 20 Sep 2002 18:31:03 GMT
Actually, Clemens already has something called CachingQueue, which does
a good job of storing most of the 'URLs to fetch' in a queue stored on

A little 'background':

But there is also a need for a data structure that contains all URLs 
that the crawler has already 'seen' (found and extraced from fetched
documented).  There is no need to store the same URL twice, and fetch
it twice, so there is this 'VisitedURLsFilter' (don't confuse 'visited'
with 'fetched', it just means 'already seen') which contains a list of
all seen URLs.
Every URL extracted from a fetched document needs to be looked up in
this VisitedURLsFilter.  If not there already, it needs to be added to
it (and to the queue of URLs to fetch).  If there already, it is thrown

Because of this, the data structure that VisitedURLsFilter uses to
store and look up URLs must be super fast.
This means that it cannot be on disk.
However, a crawler normally encounters hundreds of thousands and
millions or URLs, so storing them all in this filter wouldn't work (RAM

So the question is how to store such large number of URLs and at the
same time provide a fast lookup access to it.


--- "Spencer, Dave" <> wrote:
> I may have misunderstood something, but 
> if you're looking to reduce memory/RAM usage by 
> a convenient data structure in LARM you might
> consider jdbm (, a disk-based
> BTree.
> I wrote a Map interface to it called PersistentMap so you
> can program to it in a convenient form and possibly just drop it in
> as a replacement for any existing Map's (HashMap, TreeMap).
> It's not hard to do a presistent Set or List either.
> The main alternative to jdbm is JISP, which I haven't used:
> -----Original Message-----
> From: Clemens Marschner []
> Sent: Friday, September 20, 2002 6:58 AM
> To: Halácsy Péter
> Cc: Lucene Developers List
> Subject: Re: your crawler
> Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F
>>----- Original Message -----
> >From: Halácsy Péter
> >To:
> >Sent: Friday, September 20, 2002 12:10 PM
> >Subject: your crawler
> >
> >
> >BTW what is the status of the LARM crawler. 2 months ago I promised
> I
> could
> help from September because I would >be a PHD student of Budapest
> University
> of Technology. Did you choose avalon as a component framework?
> I'm in the last days of my master's thesis. I will get back to the
> crawler
> after Oct. 2nd (and a week of vacation on Garda's beautiful
> lakeside).
> Otis has played around with the crawler in the last two weeks, and we
> had
> long email conversations. We have found some problems one has to cope
> with.
> I.e. LARM has a relatively high memory overhead per server (I
> mentioned
> it
> was made for large intranets). Otis's 100MB RAM overflew after
> crawling
> about 40000 URLs in the .hr domain.
> I for myself have crawled 500.000 files from 500 servers with about
> 400
> mb
> of main memory (by the way, that only takes about 2-3 hours [but
> imposes
> some load on the servers...])
> We have talked about how the more or less linear rising memory
> consumption
> could be controlled. Two components use up memory: The
> URLVisitedFilter,
> which at this time simply holds a HashMap of already visited URLs;
> and
> the
> FetcherTaskQueue, which holds a CachingQueue with crawling tasks for
> each
> server. The cachingQueue itself holds up to two blocks of the queue
> in
> RAM,
> so this may rise fast if the number of servers rises (look at the
> Javadoc, I
> recall it's well documented).
> We though about controlling this by a) compressing the
> visitedFilter's
> contents, b) taking advantage of some locality property of URL
> distributions
> (making it possible to move some of the URLs to secondary storage)
> and
> c)
> binding a server to only one thread, minimizing the need for
> synchronization
> (and providing more possibilities to move the tasks out of the RAM).
> a)
> can
> be accomplished by compressing the sorted list of URLs (there are
> papers
> about that on Citeseer). Incoming URLs would have to be divided into
> blocks
> (i.e. per server) and, when a time/space threshold is reached, the
> block
> is
> compressed. I have done a little work on that already, although my
> implementation only works in batch mode, not incrementally.
> Finally, the LuceneStorage is far from being optimized, and is a
> major
> bottleneck. We thought about dividing the crawling from the indexing
> process.
> btw: Has anybody used a profiler with the Lucene indexing part? I
> suppose
> there is still a lot to optimize there.
> Regarding Avalon: I haven't had the time to look at it thoroughly.
> Mehran
> Mehr wanted to to that, but I haven't heard anything from him for
> weeks
> now.
> Probably he wants to present us the perfect solution very soon...
> What I have done is I tried to use the Jakarta BeanUtils for loading
> the
> config files. Works pretty simple (just a few lines of code, vers
> straightforward) but then the check for mandatory parameters etc.
> would
> have
> to be done by hand afterwards, something I would expect an XML reader
> to
> get
> from an xsd file or something, at least optionally.
> Back to my 15 hour day... :-|
> --Clemens
> --
> To unsubscribe, e-mail:
> <>
> For additional commands, e-mail:
> <>
> --
> To unsubscribe, e-mail:  
> <>
> For additional commands, e-mail:
> <>

Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message