manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend Garåsen <e.f.gara...@usit.uio.no>
Subject Re: Disk usage for big crawl
Date Tue, 26 Jul 2011 13:52:03 GMT

Thanks, I will send your recommendations to the database administrators. 
They are responsible for setting up maintenance strategies for the 
PostgreSQL databases running at the university.

BTW, I started to crawl all the web pages yesterday, and the job will 
probably finish later today. Then I can ask the database administrators 
to check the size of the tables (after they eventually have performed a 
vacuum). I'm not sure whether it's recommended to do a new crawl after 
they have checked the disk usage in order to find out whether a recrawl 
will increase the table space significantly. I don't think so, but we 
need to inform them in case the size will increase significantly after a 
month or two.

Erlend

On 25.07.11 15.16, Karl Wright wrote:
> Hi Erlend,
>
> I can't answer for how PostgreSQL allocates space on the whole - the
> PostgreSQL documentation may tell you more.  I can say this much:
>
> (1) Postgresql keeps "dead tuples" around until they are "vacuumed".
> This implies that the table space grows until the vacuuming operation
> takes place.
> (2) At MetaCarta, we found that PostgreSQL's normal autovacuuming
> process (which runs in background) was insufficient to keep up with
> ManifoldCF going at full tilt in a web crawl.
> (3) The solution at MetaCarta was to periodically run "maintenance",
> which involves running a VACUUM FULL operation on the database.  This
> will cause the crawl to stall while the vacuum operation is going,
> since a new (compact) disk image of every table must be made, and thus
> each table is locked for a period of time.
>
> So my suggestion is to adopt a maintenance strategy first, make sure
> it is working for you, and then calculate how much disk space you will
> need for that strategy.  Typically maintenance might be done once or
> twice a week.  Under heavy crawling (lots and lots of hosts being
> crawled), you might do maintenance once every 2 days or so.
>
> Karl
>
>
> On Mon, Jul 25, 2011 at 9:06 AM, Erlend Garåsen<e.f.garasen@usit.uio.no>  wrote:
>>
>> Hello list,
>>
>> In order to crawl around 100,000 documents, how much disk usage/table space
>> will be needed for PostgreSQL? Our database administrators are now asking.
>> Instead of starting up this crawl (which will take a lot of time) and try to
>> measure this manually, I hope we could get an answer from the list members
>> instead.
>>
>> And will the table space increase significantly for every recrqwl?
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message