manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: Scaling in MCF
Date Mon, 10 Nov 2014 13:22:31 GMT
Hi Aeham,

I have a design which should improve reprioritization dramatically.  It's
described in CONNECTORS-1100, and I'm actively working on it now.  This is,
however, pretty complicated in that document scheduling and prioritization
is anything but simple in ManifoldCF.  I'm hoping that when I'm satisfied
with the work, you will have the ability to try it out in a larger
setting.  But I'm not expecting to be ready for some weeks.

When this work is done, the minimum time required for a job start etc. will
be the time needed to clear all existing document priority values to a
nullDocumentPriority value.  If you have 100 million jobqueue rows, and
most of those are active, it will still undoubtably take PostgreSQL some
time to update all of them.  Could you come up with an estimate for how
long that would in fact take?  You mentioned one hour; how many documents
was that for?


On Fri, Nov 7, 2014 at 12:12 PM, Aeham Abushwashi <> wrote:

> Hi Karl,
> It's great to see performance and scalability emphasised as top priority
> items for Manifold! This has been clearly demonstrated through the
> attention and quick turnaround that a number of recent performance issues
> have received. This is very much appreciated!
> My team is happy to help in anyway we can to make Manifold scale and
> perform better. We'll continue to report the results of our testing and
> analyses, and would certainly be willing to contribute best-practices,
> fixes and enhancements where possible.
> Cheers,
> Aeham
> On 6 November 2014 00:39, Karl Wright <> wrote:
> > Hi all,
> >
> > We've lately had several users applying ManifoldCF to what I'd call
> "large"
> > crawls (10M - 100M documents).  This is great news, and I hope their
> > experiences turn out well.  I also hope that, once successful, these
> users
> > help us document best practices for crawls of this size.
> >
> > It's also a good opportunity to revisit the sizing constraints for MCF as
> > they exist today.  There are really two areas of interest when we
> consider
> > the large database instances needed to track this number of documents.
> The
> > first consideration is how quickly we can identify records that need to
> be
> > processed -- and insure that they are processed in an order that makes
> > sense given throttling constraints on the queue.  The second
> consideration
> > is what kind of system overhead is needed to meet the first constraint,
> and
> > whether this becomes unwieldy at some point.
> >
> > I've been pleasantly surprised at how well the current MCF architecture
> > supports document queuing even when database tables get very large.  We
> > recently encountered some bugs here, but those were easily fixed, and I
> > really see little getting in the way from this angle of MCF scaling even
> to
> > a billion documents now.  However, the overhead needed to manage that
> > scheduling relies on keeping one specific index in the proper document
> > order.  Under conditions where jobs are stopped or started, the index
> often
> > will need to be reordered.  When there are lots of documents that need to
> > be reprioritized, this can be a very time-consuming operation.  In my
> > opinion this is now the limiting factor for MCF scaling.  When it starts
> > taking an hour or more to start a job, or stop it, or restart the agents
> > process, working with MCF becomes clearly less than ideal.  So I think
> this
> > deserves some thought and work.
> >
> > Over the next couple of weeks, I'm hoping to spend some time thinking
> > through alternatives to the current index structure, which might permit
> > faster starts and stops.  There's no guarantee of a full solution, but my
> > hope would be that with some compound index magic there might be
> > significant improvements here, at no cost to the performance of queuing.
> >
> > Thanks,
> > Karl
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message