manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: Disk usage for big crawl
Date Mon, 25 Jul 2011 13:16:26 GMT
Hi Erlend,

I can't answer for how PostgreSQL allocates space on the whole - the
PostgreSQL documentation may tell you more.  I can say this much:

(1) Postgresql keeps "dead tuples" around until they are "vacuumed".
This implies that the table space grows until the vacuuming operation
takes place.
(2) At MetaCarta, we found that PostgreSQL's normal autovacuuming
process (which runs in background) was insufficient to keep up with
ManifoldCF going at full tilt in a web crawl.
(3) The solution at MetaCarta was to periodically run "maintenance",
which involves running a VACUUM FULL operation on the database.  This
will cause the crawl to stall while the vacuum operation is going,
since a new (compact) disk image of every table must be made, and thus
each table is locked for a period of time.

So my suggestion is to adopt a maintenance strategy first, make sure
it is working for you, and then calculate how much disk space you will
need for that strategy.  Typically maintenance might be done once or
twice a week.  Under heavy crawling (lots and lots of hosts being
crawled), you might do maintenance once every 2 days or so.


On Mon, Jul 25, 2011 at 9:06 AM, Erlend GarĂ¥sen <> wrote:
> Hello list,
> In order to crawl around 100,000 documents, how much disk usage/table space
> will be needed for PostgreSQL? Our database administrators are now asking.
> Instead of starting up this crawl (which will take a lot of time) and try to
> measure this manually, I hope we could get an answer from the list members
> instead.
> And will the table space increase significantly for every recrqwl?
> Erlend
> --
> Erlend GarĂ¥sen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

View raw message