incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chapuis Bertil <bchap...@agimem.com>
Subject Re: Queue: in memory or on disk?
Date Sat, 14 Nov 2009 12:06:35 GMT
Personally I used Droids to crawl a website of approximately 250000 pages. The queue was stored
in memory and I arbitrarily allocated 1GB of memory to java. Everything worked fine. 

That's not a large number of webpages but I think droids' current implementation is well suited
for such jobs: crawling a relatively small set of webpage or crawling an intranet. This is
particularly right if you need to customize the handling process of the pages. 

I Hope this experience may help.

Bertil Chapuis


On Nov 14, 2009, at 3:59 AM, Otis Gospodnetic wrote:

> OK, thanks.
> 
> So how do people really use Droids at scale? e.g. crawling a large number of web pages?
 I happen to use it for something smalish, so I never had issues with the queue being in the
JVM heap and getting OOMs because of that.  But I imagine that anyone using it for a larger
crawl would hit OOM sooner or later, no?
> 
> Does this imply that either nobody is using Droids for large-scale crawls, or that everyone
who does implemented their own, custom disk-backed queue?
> 
> 
> Thanks,
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
>> From: Ryan McKinley <ryantxu@gmail.com>
>> To: droids-dev@incubator.apache.org
>> Sent: Fri, November 13, 2009 5:17:51 PM
>> Subject: Re: Queue: in memory or on disk?
>> 
>> ya, the standard one is in memory.
>> 
>> It is easy to write one to store things to disk or whatever -- I use one that 
>> stores tasks to an h2 database, but it is not general enough to contribute 
>> back...
>> 
>> I think Migfa was looking at replacing the droids Queue interface with a 
>> standard java.util.Queue interface
>> 
>> ryan
>> 
>> 
>> On Nov 13, 2009, at 5:10 PM, Chapuis Bertil wrote:
>> 
>>> I think the current implementation only provides in memory queues of tasks. 
>> However, since the TaskQueue interface is relatively simple it shouldn't be too 
>> hard to persists the data on the disk or to implement a TaskQueue which works 
>> with a JMS broker or something else.
>>> 
>>> 
>>> On Nov 12, 2009, at 10:37 PM, Otis Gospodnetic wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I haven't looked at the sources.  But who stores items put in the Queue?
 Are 
>> they in memory, or does something write them to disk, or something else?
>>>> 
>>>> Thanks,
>>>> Otis
>>>> --
>>>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>>>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>>> 
>>> 
> 


Mime
View raw message