manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Boichat <paul.boic...@exonar.com>
Subject Re: Apache ManifoldCF Performance
Date Wed, 10 Sep 2014 17:01:25 GMT
Hi Karl,

We're beginning to see issues with a document count > 10 million. At that
point, even with good postgres vacuuming the jobqueue table is starting to
become a bottleneck.

For example select count(*) from jobqueue, which is executed when querying
job status will do a full table scan of jobqueue which has more than 10
million rows. That's going to take some time in postgres.

SSDs will certainly make a big difference to document processing
through-put (which we see is largely I/O bound in postgres) but we are
increasingly seeing long running queries in the logs. Our current thinking
is that we'll need to refactor JobQueue somewhat to optimise queries and,
potentially partition jobqueue into a subset of tables (table per queue for
example).

Paul



VP Engineering,
Exonar Ltd

T: +44 7940 567724

twitter:@exonarco @pboichat
W: http://www.exonar.com
Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>

Exonar Limited, registered in the UK, registration number 06439969 at 14
West Mills, Newbury, Berkshire, RG14 5HG
DISCLAIMER: This email and any attachments to it may be confidential and
are intended solely for the use of the individual to whom it is addressed.
Any views or opinions expressed are solely those of the author and do not
necessarily represent those of Exonar Ltd. If you are not the intended
recipient of this email, you must neither take any action based upon its
contents, nor copy or show it to anyone. Please contact the sender if you
believe you have received this email in error.

On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Baptiste,
>
> ManifoldCF is not limited by the number of agents processes or parallel
> connectors.  Overall database performance is the limiting factor.
>
> I would read this:
>
> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>
> Also, there's a section in ManifoldCF (I believe Chapter 2) that discusses
> this issue.
>
> Some five years ago, I successfully crawled 5 million web documents, using
> Postgresql 8.3.  Postgresql 9.x is faster, and with modern SSD's, I expect
> that you will do even better.  In general, I'd say it was fine to shoot for
> 10M - 100M documents on ManifoldCF, provided that you use a good database,
> and provided that you maintain it properly.
>
> Thanks,
> Karl
>
>
>
>
>
> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste Berthier <ba.berthier@gmail.com
> > wrote:
>
>> Hi
>>
>> I would like to know what is the maximum number of documents that you
>> managed to crawl with ManifoldCF and with how many connectors in parallel
>> it could works ?
>>
>> Thanks for your answer
>>
>> Baptiste
>>
>
>

Mime
View raw message