manifoldcf-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Connectors Framework > FAQ
Date Tue, 12 Oct 2010 21:32:00 GMT
Space: Apache Connectors Framework (
Page: FAQ (

Comment added by Karl Wright:

The sample I used was some 30,000 documents.

Several effects come into play for larger, more extended crawls.  PostgreSQL accumulates
"dead tuples" over time, which impact performance.  There is a procedure for cleaning
this up, which I believe is documented in the "Build and Deploy" page, involving a VACUUM
FULL operation.

Second, if you use PostgreSQL's configuration out of the box, you are likely getting a background
VACUUM operation starting at some point during your crawl.  This background-process
vacuum is insufficient to keep up with dead tuple accumulation and only serves to slow things
down.  So turn "autovacuum" to OFF.  This is also mentioned in the build-and-deploy

Third, ManifoldCF itself periodically asks PostgreSQL to reindex data, which can have an overall
impact on performance.  The time at which it performs this activity is every
100,000 inserts/modifies to the queue.  That is obviously more than the size crawl
I ran.

Hope this answers your question.

In reply to a comment by Farzad:
How big was the file share you crawled?  I have 280,000 files spread across a lot of directories.
 It starts out 29-31 docs / sec, but as it crawls it gets slower.  For example, at 98000,
it was doing 31 docs a second, at 203000 it is doing 16 docs a second.  So I'm just curious
how long have you been able to sustain the ~30 doc/sec.

Change your notification preferences:

View raw message