Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@manifoldcf.apache.org
Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates
 209.85.160.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAB-fSbxp=3ButVrjF=sCX548eX0m8r5oWLJHbMWBi+qva2VwUA@mail.gmail.com>
References: 
 <CALUFAGAqwueALhmiVdZo6YF3A=y9mh+nteY1VBaNFA70U1EMoQ@mail.gmail.com>
	<CAKoYS6Kf4Sn9VNUzkBsQW9-pbfO9Y20J+Fwv6116OpGt+16bbQ@mail.gmail.com>
	<CALUFAGBDoVaiUtbxQks3ygpGW66XzLtrXvKeEQj6sYppUiJpZQ@mail.gmail.com>
	<CAB-fSbxp=3ButVrjF=sCX548eX0m8r5oWLJHbMWBi+qva2VwUA@mail.gmail.com>
Date: Mon, 10 Nov 2014 09:44:11 -0500
Message-ID: 
 <CALUFAGC4n0jmG8a0xQgNCXf8AsXryX0XmDbR_hkq0+Bw0DEA5w@mail.gmail.com>
Subject: Re: Scaling in MCF
From: Karl Wright <daddywri@gmail.com>
To: dev <dev@manifoldcf.apache.org>
Content-Type: multipart/alternative; boundary=001a1137fe5a2428af0507823273

--001a1137fe5a2428af0507823273
Content-Type: text/plain; charset=UTF-8

Hi Alessandro,

bq. is there any parallelism in the indexing process?

Yes, it is highly parallel.  Many man-years of effort have gone into making
sure there are no bottlenecks in document processing and indexing.

bq. Are each row indexed sequentially from a datasource ? Are jobs
occurring parallelly across different data sources?

You should read the architecture chapters of MCF in Action.  Short answer:
there is no ordering, and worker threads handle documents from multiple
jobs.

bq. Manifold is really slow in Indexing Gb of Data ( simply crawling from
windows shares or Alfresco).

For Windows shares, the bottleneck is very likely to be Windows itself, and
you can't improve that by increasing parallelism, because Windows servers
will fall over and die if you try.  We recommend, in fact, throttling JCIFS
connections heavily to prevent that from occurring.

For Alfresco, I have noted from others that Alfresco is often also a
bottleneck.  I believe that people tend to severely under-resource their
Alfresco instances.  You may get better results if you give more memory to
your instance.

In both cases I highly recommend getting a couple of thread dumps during
crawling.  This is crude but very helpful in determining where the
bottleneck in fact lies.  If it is the repository, as I suspect in your
case, then you cannot improve things by tweaking MCF in any way.

bq. Thinking something like SolrCloud is for Solr, for Manifold...

You can spin up multiple agents processes in fact, and have been able to do
this since MCF 1.5.  However, I doubt this will help you, given your
description of the problem so far.

Thanks,
Karl


On Mon, Nov 10, 2014 at 9:30 AM, Alessandro Benedetti <
benedetti.alex85@gmail.com> wrote:

> Hi Karl,
> just thinking how it works right now ...
> is there any parallelism in the indexing process?
> Are each row indexed sequentially from a datasource ?
> Are jobs occurring parallelly across different data sources? ( I think yes)
>
> Because I was thinking, at least in my use cases,
> Manifold is really slow in Indexing Gb of Data ( simply crawling from
> windows shares or Alfresco).
> Maybe this performances can be helped with a proper cluster organizaiton of
> different Manifold Instances.
> Thinking something like SolrCloud is for Solr, for Manifold...
> Is there any thought about an architecture of different Manifold instances
> working together ?
>
> Cheers
>
> 2014-11-10 13:22 GMT+00:00 Karl Wright <daddywri@gmail.com>:
>
> > Hi Aeham,
> >
> > I have a design which should improve reprioritization dramatically.  It's
> > described in CONNECTORS-1100, and I'm actively working on it now.  This
> is,
> > however, pretty complicated in that document scheduling and
> prioritization
> > is anything but simple in ManifoldCF.  I'm hoping that when I'm satisfied
> > with the work, you will have the ability to try it out in a larger
> > setting.  But I'm not expecting to be ready for some weeks.
> >
> > When this work is done, the minimum time required for a job start etc.
> will
> > be the time needed to clear all existing document priority values to a
> > nullDocumentPriority value.  If you have 100 million jobqueue rows, and
> > most of those are active, it will still undoubtably take PostgreSQL some
> > time to update all of them.  Could you come up with an estimate for how
> > long that would in fact take?  You mentioned one hour; how many documents
> > was that for?
> >
> > Karl
> >
> >
> >
> > On Fri, Nov 7, 2014 at 12:12 PM, Aeham Abushwashi <
> > aeham.abushwashi@exonar.com> wrote:
> >
> > > Hi Karl,
> > >
> > > It's great to see performance and scalability emphasised as top
> priority
> > > items for Manifold! This has been clearly demonstrated through the
> > > attention and quick turnaround that a number of recent performance
> issues
> > > have received. This is very much appreciated!
> > >
> > > My team is happy to help in anyway we can to make Manifold scale and
> > > perform better. We'll continue to report the results of our testing and
> > > analyses, and would certainly be willing to contribute best-practices,
> > > fixes and enhancements where possible.
> > >
> > > Cheers,
> > > Aeham
> > >
> > > On 6 November 2014 00:39, Karl Wright <daddywri@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We've lately had several users applying ManifoldCF to what I'd call
> > > "large"
> > > > crawls (10M - 100M documents).  This is great news, and I hope their
> > > > experiences turn out well.  I also hope that, once successful, these
> > > users
> > > > help us document best practices for crawls of this size.
> > > >
> > > > It's also a good opportunity to revisit the sizing constraints for
> MCF
> > as
> > > > they exist today.  There are really two areas of interest when we
> > > consider
> > > > the large database instances needed to track this number of
> documents.
> > > The
> > > > first consideration is how quickly we can identify records that need
> to
> > > be
> > > > processed -- and insure that they are processed in an order that
> makes
> > > > sense given throttling constraints on the queue.  The second
> > > consideration
> > > > is what kind of system overhead is needed to meet the first
> constraint,
> > > and
> > > > whether this becomes unwieldy at some point.
> > > >
> > > > I've been pleasantly surprised at how well the current MCF
> architecture
> > > > supports document queuing even when database tables get very large.
> We
> > > > recently encountered some bugs here, but those were easily fixed,
> and I
> > > > really see little getting in the way from this angle of MCF scaling
> > even
> > > to
> > > > a billion documents now.  However, the overhead needed to manage that
> > > > scheduling relies on keeping one specific index in the proper
> document
> > > > order.  Under conditions where jobs are stopped or started, the index
> > > often
> > > > will need to be reordered.  When there are lots of documents that
> need
> > to
> > > > be reprioritized, this can be a very time-consuming operation.  In my
> > > > opinion this is now the limiting factor for MCF scaling.  When it
> > starts
> > > > taking an hour or more to start a job, or stop it, or restart the
> > agents
> > > > process, working with MCF becomes clearly less than ideal.  So I
> think
> > > this
> > > > deserves some thought and work.
> > > >
> > > > Over the next couple of weeks, I'm hoping to spend some time thinking
> > > > through alternatives to the current index structure, which might
> permit
> > > > faster starts and stops.  There's no guarantee of a full solution,
> but
> > my
> > > > hope would be that with some compound index magic there might be
> > > > significant improvements here, at no cost to the performance of
> > queuing.
> > > >
> > > > Thanks,
> > > > Karl
> > > >
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

--001a1137fe5a2428af0507823273--