manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: creating a manifoldCF cluster
Date Thu, 03 Sep 2015 20:37:10 GMT

First, in order to have multiple agents running, each one needs to have its
own unique ID.  See the zookeeper multiprocess example; there are two
agents startup scripts, and each one starts a different agents instance
with a different ID.

Second, the throughput you are seeing may not be limited by ManifoldCF.  In
fact, it is probably limited by Documentum.  You can easily confirm that by
trying to do a similar crawl of something like the local file system, and
seeing what kind of throughput you get there.  Adding more hardware on the
MCF side will not improve Documentum performance, obviously.


On Thu, Sep 3, 2015 at 2:52 PM, Mike Caceres <> wrote:

> I need to crawl a few million documents from documentum and send them to
> elasticsearch. After the initial full crawl is done, I am planning to have
> daily incremental crawls in order to keep the indexes up to date.
> Things are working now, but the crawler is advancing at a rate that needs
> to be improved, so I am thinking of setting up multiple machines to crawl
> the documents simultaneously.
> The repository connection has a max of 50 connections. I am using
> PostgreSQL as database.  Other relevant ManifoldCF properties are set to
> these values:
> <property name="org.apache.manifoldcf.database.maxhandles" value="200"/>
> <property name="org.apache.manifoldcf.db.postgres.analyze.jobqueue"
> value="5000000"/>
> <property name="org.apache.manifoldcf.db.postgres.reindex.jobqueue"
> value="5000000"/>
> <property name="org.apache.manifoldcf.crawler.threads" value="50"/>
> With this configuration I can see 1 document per second being processed
> end to end - I mean: many documents are being fetched and sent to index in
> parallel, but the overall throughput I can see via the manifoldcf-ui is
> about 60 documents in 60 seconds.
> It looks like if I play with these configuration variables setting them
> with different values, the throughput does not vary that much, at the most
> I can get 90 documents in 60 seconds. I think I followed all the
> recommendations mentioned in
> <>
>  and
> <>
> .
> So now I am planning to see if I can run ManifoldCF in a
> multi-process/multi-server fashion, in order to improve that throughput.
> The hope is that if the initial crawl is made by let's say 10 machines,
> the ingestion may finish in a few days instead of a month or so.
> I started by just having two servers (let's call them ServerOne and
> ServerTwo) pointing to the same postgresSQL and the same zookeepers (both
> running in ServerOne)
> I can successfully start all manifoldcf related processes on ServerOne,
> and started running an crawling job in this server. Now I'd like to add
> ServerTwo to the picture.
> The problem I am observing is that the Agents processes in ServerTwo are
> dying as soon as I start them, with this error:
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Service 'A' of
> type 'AGENT' is already active
>         at
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockManager.registerServiceBeginServiceActivity(
>         at
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockManager.registerServiceBeginServiceActivity(
>         at
> org.apache.manifoldcf.agents.AgentRun.doExecute(
>         at
> org.apache.manifoldcf.agents.BaseAgentsInitializationCommand.execute(
>         at org.apache.manifoldcf.agents.AgentRun.main(
> [Shutdown thread] INFO org.apache.zookeeper.ZooKeeper - Session:
> 0x14f7372545232d9 closed
> [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut
> down
> So my questions are:
> a) should I configure something else or differently in either ServerOne or
> ServerTwo in order to achieve the scenario of multiple machines crawling
> documentum simultaneously?
> b) is this a common practice?
> Thank you!
> Mike

View raw message