manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: Parallelize jobs
Date Mon, 26 Feb 2018 16:14:27 GMT
Hi Julien,

There's actually quite a bit of logic in MCF to run jobs concurrently.  The
problem, though, is that documents are "scheduled" in advance, and that
scheduling is not readily updateable on the fly.  So if you have a job
running that has already queued 100,000 documents and then you start
another job, that job's documents won't get processed until the first job's
100,000 documents are processed.  After that the jobs will run concurrently.

The reason this happens is because MCF is based on a database for managing
its queue.  The query that locates documents for processing needs to order
them by something so that documents are handled fairly.  The field that
this is contained is the "docpriority" field, if you are interested.

For connectors that identify all the documents they are going to crawl all
in the seeding phase, this makes it look like jobs are completely
sequential.  For most connectors, however, that is not the case.


On Mon, Feb 26, 2018 at 11:02 AM, Julien <>

> Hi MCF community,
> I was wondering if MCF is able to run several jobs concurrently and if
> there is a specific configuration to do that.
> Because I have tested to create two jobs, one using a file system input
> repository and one using a JCIFS input repository, the output is the same
> for both jobs (Solr). When I start them both, the execution is sequential,
> one job is somehow waiting till the other one is done.
> I tested it on a MCF v2.7
> Regards,
> Julien
> <>
> sans virus.
> <>
> <#m_-6868647087496393437_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

View raw message