incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chapuis Bertil <>
Subject Re: Queue: in memory or on disk?
Date Sat, 14 Nov 2009 12:06:35 GMT
Personally I used Droids to crawl a website of approximately 250000 pages. The queue was stored
in memory and I arbitrarily allocated 1GB of memory to java. Everything worked fine. 

That's not a large number of webpages but I think droids' current implementation is well suited
for such jobs: crawling a relatively small set of webpage or crawling an intranet. This is
particularly right if you need to customize the handling process of the pages. 

I Hope this experience may help.

Bertil Chapuis

On Nov 14, 2009, at 3:59 AM, Otis Gospodnetic wrote:

> OK, thanks.
> So how do people really use Droids at scale? e.g. crawling a large number of web pages?
 I happen to use it for something smalish, so I never had issues with the queue being in the
JVM heap and getting OOMs because of that.  But I imagine that anyone using it for a larger
crawl would hit OOM sooner or later, no?
> Does this imply that either nobody is using Droids for large-scale crawls, or that everyone
who does implemented their own, custom disk-backed queue?
> Thanks,
> Otis
> --
> Sematext is hiring --
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> ----- Original Message ----
>> From: Ryan McKinley <>
>> To:
>> Sent: Fri, November 13, 2009 5:17:51 PM
>> Subject: Re: Queue: in memory or on disk?
>> ya, the standard one is in memory.
>> It is easy to write one to store things to disk or whatever -- I use one that 
>> stores tasks to an h2 database, but it is not general enough to contribute 
>> back...
>> I think Migfa was looking at replacing the droids Queue interface with a 
>> standard java.util.Queue interface
>> ryan
>> On Nov 13, 2009, at 5:10 PM, Chapuis Bertil wrote:
>>> I think the current implementation only provides in memory queues of tasks. 
>> However, since the TaskQueue interface is relatively simple it shouldn't be too 
>> hard to persists the data on the disk or to implement a TaskQueue which works 
>> with a JMS broker or something else.
>>> On Nov 12, 2009, at 10:37 PM, Otis Gospodnetic wrote:
>>>> Hello,
>>>> I haven't looked at the sources.  But who stores items put in the Queue?
>> they in memory, or does something write them to disk, or something else?
>>>> Thanks,
>>>> Otis
>>>> --
>>>> Sematext is hiring --
>>>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR

View raw message