manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Apache Manifoldcf High Availability requirements
Date Wed, 16 Apr 2014 13:59:39 GMT
Hi Lalit,

If I were you, I'd do a sample crawl of a characteristic subset of your
documents, and then assess the space required by the database for that.
There's no way I can assess this in advance, because each connector has
different space requirements in the database, and it depends to some degree
on your documents as well -- specifically, the document metadata.

You should also read up on Postgresql maintenance procedures, because
vacuuming frequency will determine how much extra disk space you will
require due to dead tuples.

Thanks,
Karl





On Wed, Apr 16, 2014 at 9:53 AM, lalit jangra <lalit.j.jangra@gmail.com>wrote:

> Thanks Karl,
>
> I also want to know how to size disks for such setup.  I assume primarily
> the disk size will be taken by DB which is PostgreSQL here so what size to
> start with and what should be the expansion policy here keeping in mind i
> have minimum 10 million documents at start and similar volumes will be
> added each year.
>
>
> On Wed, Apr 16, 2014 at 12:51 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Lalit,
>>
>> ManifoldCF when operating in a clustered scenario will not work with
>> separate DB instances, even if they are synched.  You can only operate it
>> under conditions where transactional integrity is maintained, which would
>> be a single common clustered DB instance.
>>
>> I'll let others talk to your other points.
>>
>> (Graeme, are you following this?)
>>
>> Karl
>>
>>
>>
>> On Wed, Apr 16, 2014 at 7:40 AM, lalit jangra <lalit.j.jangra@gmail.com>wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I am using MCF for crawling multiple sources having around 10-15 million
>>> documents initially & similar volumes added each year and I want it to be
>>> clustered in high availability mode. For same, I have some questions in
>>> mind.
>>>
>>> 1.       I am using PostgreSQL DB with tomcat 7 hosting MCF.
>>>
>>> 2.       How much DB size should be considered for such scenarios as we
>>> have documents in magnitude of TBs.
>>>
>>> 3.       Does PostgreSQL run on VMs.
>>>
>>> 4.       What would be the ideal clustering approach: having two
>>> different MCF servers managed by Zookeeper with each having its own  DB
>>> which are in sync with each other  managed by a set of two load
>>> balancers or two different MCF instances having a common
>>> clustered(active/passive) DB instance managed by set of two load balancers.
>>>
>>> 5.       If I use first approach : having two different MCF servers
>>> managed by Zookeeper with each having its own  DB which are in sync
>>> with each other  managed by a set of two load balancers – I need to
>>> sync both DB instances having extra tasks added.
>>>
>>> 6.       If I use second approach : or two different MCF instances
>>> having a common clustered(active/passive) DB instance managed by set of two
>>> load balancers – I have a set of clustered DBs.
>>>
>>> 7.       Which of these approaches would yield better results?
>>>
>>> 8.       Is there any definitive guide for high availability of MCF?
>>>
>>> Regards,
>>>
>>> Lalit.
>>>
>>>
>>>
>>
>>
>
>
> --
> Regards,
> Lalit Jangra.
>

Mime
View raw message