manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Scaling in MCF
Date Thu, 06 Nov 2014 00:39:23 GMT
Hi all,

We've lately had several users applying ManifoldCF to what I'd call "large"
crawls (10M - 100M documents).  This is great news, and I hope their
experiences turn out well.  I also hope that, once successful, these users
help us document best practices for crawls of this size.

It's also a good opportunity to revisit the sizing constraints for MCF as
they exist today.  There are really two areas of interest when we consider
the large database instances needed to track this number of documents.  The
first consideration is how quickly we can identify records that need to be
processed -- and insure that they are processed in an order that makes
sense given throttling constraints on the queue.  The second consideration
is what kind of system overhead is needed to meet the first constraint, and
whether this becomes unwieldy at some point.

I've been pleasantly surprised at how well the current MCF architecture
supports document queuing even when database tables get very large.  We
recently encountered some bugs here, but those were easily fixed, and I
really see little getting in the way from this angle of MCF scaling even to
a billion documents now.  However, the overhead needed to manage that
scheduling relies on keeping one specific index in the proper document
order.  Under conditions where jobs are stopped or started, the index often
will need to be reordered.  When there are lots of documents that need to
be reprioritized, this can be a very time-consuming operation.  In my
opinion this is now the limiting factor for MCF scaling.  When it starts
taking an hour or more to start a job, or stop it, or restart the agents
process, working with MCF becomes clearly less than ideal.  So I think this
deserves some thought and work.

Over the next couple of weeks, I'm hoping to spend some time thinking
through alternatives to the current index structure, which might permit
faster starts and stops.  There's no guarantee of a full solution, but my
hope would be that with some compound index magic there might be
significant improvements here, at no cost to the performance of queuing.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message