lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Spencer, Dave" <>
Subject memory usage - RE: your crawler
Date Fri, 20 Sep 2002 18:04:53 GMT
I may have misunderstood something, but 
if you're looking to reduce memory/RAM usage by 
a convenient data structure in LARM you might
consider jdbm (, a disk-based

I wrote a Map interface to it called PersistentMap so you
can program to it in a convenient form and possibly just drop it in
as a replacement for any existing Map's (HashMap, TreeMap).
It's not hard to do a presistent Set or List either.

The main alternative to jdbm is JISP, which I haven't used:

-----Original Message-----
From: Clemens Marschner []
Sent: Friday, September 20, 2002 6:58 AM
To: Halácsy Péter
Cc: Lucene Developers List
Subject: Re: your crawler

Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F>----- Original Message -----
>From: Halácsy Péter
>Sent: Friday, September 20, 2002 12:10 PM
>Subject: your crawler
>BTW what is the status of the LARM crawler. 2 months ago I promised I
help from September because I would >be a PHD student of Budapest
of Technology. Did you choose avalon as a component framework?

I'm in the last days of my master's thesis. I will get back to the
after Oct. 2nd (and a week of vacation on Garda's beautiful lakeside).

Otis has played around with the crawler in the last two weeks, and we
long email conversations. We have found some problems one has to cope
I.e. LARM has a relatively high memory overhead per server (I mentioned
was made for large intranets). Otis's 100MB RAM overflew after crawling
about 40000 URLs in the .hr domain.
I for myself have crawled 500.000 files from 500 servers with about 400
of main memory (by the way, that only takes about 2-3 hours [but imposes
some load on the servers...])

We have talked about how the more or less linear rising memory
could be controlled. Two components use up memory: The URLVisitedFilter,
which at this time simply holds a HashMap of already visited URLs; and
FetcherTaskQueue, which holds a CachingQueue with crawling tasks for
server. The cachingQueue itself holds up to two blocks of the queue in
so this may rise fast if the number of servers rises (look at the
Javadoc, I
recall it's well documented).

We though about controlling this by a) compressing the visitedFilter's
contents, b) taking advantage of some locality property of URL
(making it possible to move some of the URLs to secondary storage) and
binding a server to only one thread, minimizing the need for
(and providing more possibilities to move the tasks out of the RAM). a)
be accomplished by compressing the sorted list of URLs (there are papers
about that on Citeseer). Incoming URLs would have to be divided into
(i.e. per server) and, when a time/space threshold is reached, the block
compressed. I have done a little work on that already, although my
implementation only works in batch mode, not incrementally.

Finally, the LuceneStorage is far from being optimized, and is a major
bottleneck. We thought about dividing the crawling from the indexing

btw: Has anybody used a profiler with the Lucene indexing part? I
there is still a lot to optimize there.

Regarding Avalon: I haven't had the time to look at it thoroughly.
Mehr wanted to to that, but I haven't heard anything from him for weeks
Probably he wants to present us the perfect solution very soon...

What I have done is I tried to use the Jakarta BeanUtils for loading the
config files. Works pretty simple (just a few lines of code, vers
straightforward) but then the check for mandatory parameters etc. would
to be done by hand afterwards, something I would expect an XML reader to
from an xsd file or something, at least optionally.

Back to my 15 hour day... :-|


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message