manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: Scaling in MCF
Date Mon, 10 Nov 2014 14:30:10 GMT
Hi Karl,
just thinking how it works right now ...
is there any parallelism in the indexing process?
Are each row indexed sequentially from a datasource ?
Are jobs occurring parallelly across different data sources? ( I think yes)

Because I was thinking, at least in my use cases,
Manifold is really slow in Indexing Gb of Data ( simply crawling from
windows shares or Alfresco).
Maybe this performances can be helped with a proper cluster organizaiton of
different Manifold Instances.
Thinking something like SolrCloud is for Solr, for Manifold...
Is there any thought about an architecture of different Manifold instances
working together ?

Cheers

2014-11-10 13:22 GMT+00:00 Karl Wright <daddywri@gmail.com>:

> Hi Aeham,
>
> I have a design which should improve reprioritization dramatically.  It's
> described in CONNECTORS-1100, and I'm actively working on it now.  This is,
> however, pretty complicated in that document scheduling and prioritization
> is anything but simple in ManifoldCF.  I'm hoping that when I'm satisfied
> with the work, you will have the ability to try it out in a larger
> setting.  But I'm not expecting to be ready for some weeks.
>
> When this work is done, the minimum time required for a job start etc. will
> be the time needed to clear all existing document priority values to a
> nullDocumentPriority value.  If you have 100 million jobqueue rows, and
> most of those are active, it will still undoubtably take PostgreSQL some
> time to update all of them.  Could you come up with an estimate for how
> long that would in fact take?  You mentioned one hour; how many documents
> was that for?
>
> Karl
>
>
>
> On Fri, Nov 7, 2014 at 12:12 PM, Aeham Abushwashi <
> aeham.abushwashi@exonar.com> wrote:
>
> > Hi Karl,
> >
> > It's great to see performance and scalability emphasised as top priority
> > items for Manifold! This has been clearly demonstrated through the
> > attention and quick turnaround that a number of recent performance issues
> > have received. This is very much appreciated!
> >
> > My team is happy to help in anyway we can to make Manifold scale and
> > perform better. We'll continue to report the results of our testing and
> > analyses, and would certainly be willing to contribute best-practices,
> > fixes and enhancements where possible.
> >
> > Cheers,
> > Aeham
> >
> > On 6 November 2014 00:39, Karl Wright <daddywri@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > We've lately had several users applying ManifoldCF to what I'd call
> > "large"
> > > crawls (10M - 100M documents).  This is great news, and I hope their
> > > experiences turn out well.  I also hope that, once successful, these
> > users
> > > help us document best practices for crawls of this size.
> > >
> > > It's also a good opportunity to revisit the sizing constraints for MCF
> as
> > > they exist today.  There are really two areas of interest when we
> > consider
> > > the large database instances needed to track this number of documents.
> > The
> > > first consideration is how quickly we can identify records that need to
> > be
> > > processed -- and insure that they are processed in an order that makes
> > > sense given throttling constraints on the queue.  The second
> > consideration
> > > is what kind of system overhead is needed to meet the first constraint,
> > and
> > > whether this becomes unwieldy at some point.
> > >
> > > I've been pleasantly surprised at how well the current MCF architecture
> > > supports document queuing even when database tables get very large.  We
> > > recently encountered some bugs here, but those were easily fixed, and I
> > > really see little getting in the way from this angle of MCF scaling
> even
> > to
> > > a billion documents now.  However, the overhead needed to manage that
> > > scheduling relies on keeping one specific index in the proper document
> > > order.  Under conditions where jobs are stopped or started, the index
> > often
> > > will need to be reordered.  When there are lots of documents that need
> to
> > > be reprioritized, this can be a very time-consuming operation.  In my
> > > opinion this is now the limiting factor for MCF scaling.  When it
> starts
> > > taking an hour or more to start a job, or stop it, or restart the
> agents
> > > process, working with MCF becomes clearly less than ideal.  So I think
> > this
> > > deserves some thought and work.
> > >
> > > Over the next couple of weeks, I'm hoping to spend some time thinking
> > > through alternatives to the current index structure, which might permit
> > > faster starts and stops.  There's no guarantee of a full solution, but
> my
> > > hope would be that with some compound index magic there might be
> > > significant improvements here, at no cost to the performance of
> queuing.
> > >
> > > Thanks,
> > > Karl
> > >
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message