manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: Scaling in MCF
Date Mon, 10 Nov 2014 17:00:53 GMT
Thank's karl, we are going to talk again about this when I will face again
that task but so far thank's for the detailed explanation.

Cheers

2014-11-10 14:44 GMT+00:00 Karl Wright <daddywri@gmail.com>:

> Hi Alessandro,
>
> bq. is there any parallelism in the indexing process?
>
> Yes, it is highly parallel.  Many man-years of effort have gone into making
> sure there are no bottlenecks in document processing and indexing.
>
> bq. Are each row indexed sequentially from a datasource ? Are jobs
> occurring parallelly across different data sources?
>
> You should read the architecture chapters of MCF in Action.  Short answer:
> there is no ordering, and worker threads handle documents from multiple
> jobs.
>
> bq. Manifold is really slow in Indexing Gb of Data ( simply crawling from
> windows shares or Alfresco).
>
> For Windows shares, the bottleneck is very likely to be Windows itself, and
> you can't improve that by increasing parallelism, because Windows servers
> will fall over and die if you try.  We recommend, in fact, throttling JCIFS
> connections heavily to prevent that from occurring.
>
> For Alfresco, I have noted from others that Alfresco is often also a
> bottleneck.  I believe that people tend to severely under-resource their
> Alfresco instances.  You may get better results if you give more memory to
> your instance.
>
> In both cases I highly recommend getting a couple of thread dumps during
> crawling.  This is crude but very helpful in determining where the
> bottleneck in fact lies.  If it is the repository, as I suspect in your
> case, then you cannot improve things by tweaking MCF in any way.
>
> bq. Thinking something like SolrCloud is for Solr, for Manifold...
>
> You can spin up multiple agents processes in fact, and have been able to do
> this since MCF 1.5.  However, I doubt this will help you, given your
> description of the problem so far.
>
> Thanks,
> Karl
>
>
> On Mon, Nov 10, 2014 at 9:30 AM, Alessandro Benedetti <
> benedetti.alex85@gmail.com> wrote:
>
> > Hi Karl,
> > just thinking how it works right now ...
> > is there any parallelism in the indexing process?
> > Are each row indexed sequentially from a datasource ?
> > Are jobs occurring parallelly across different data sources? ( I think
> yes)
> >
> > Because I was thinking, at least in my use cases,
> > Manifold is really slow in Indexing Gb of Data ( simply crawling from
> > windows shares or Alfresco).
> > Maybe this performances can be helped with a proper cluster organizaiton
> of
> > different Manifold Instances.
> > Thinking something like SolrCloud is for Solr, for Manifold...
> > Is there any thought about an architecture of different Manifold
> instances
> > working together ?
> >
> > Cheers
> >
> > 2014-11-10 13:22 GMT+00:00 Karl Wright <daddywri@gmail.com>:
> >
> > > Hi Aeham,
> > >
> > > I have a design which should improve reprioritization dramatically.
> It's
> > > described in CONNECTORS-1100, and I'm actively working on it now.  This
> > is,
> > > however, pretty complicated in that document scheduling and
> > prioritization
> > > is anything but simple in ManifoldCF.  I'm hoping that when I'm
> satisfied
> > > with the work, you will have the ability to try it out in a larger
> > > setting.  But I'm not expecting to be ready for some weeks.
> > >
> > > When this work is done, the minimum time required for a job start etc.
> > will
> > > be the time needed to clear all existing document priority values to a
> > > nullDocumentPriority value.  If you have 100 million jobqueue rows, and
> > > most of those are active, it will still undoubtably take PostgreSQL
> some
> > > time to update all of them.  Could you come up with an estimate for how
> > > long that would in fact take?  You mentioned one hour; how many
> documents
> > > was that for?
> > >
> > > Karl
> > >
> > >
> > >
> > > On Fri, Nov 7, 2014 at 12:12 PM, Aeham Abushwashi <
> > > aeham.abushwashi@exonar.com> wrote:
> > >
> > > > Hi Karl,
> > > >
> > > > It's great to see performance and scalability emphasised as top
> > priority
> > > > items for Manifold! This has been clearly demonstrated through the
> > > > attention and quick turnaround that a number of recent performance
> > issues
> > > > have received. This is very much appreciated!
> > > >
> > > > My team is happy to help in anyway we can to make Manifold scale and
> > > > perform better. We'll continue to report the results of our testing
> and
> > > > analyses, and would certainly be willing to contribute
> best-practices,
> > > > fixes and enhancements where possible.
> > > >
> > > > Cheers,
> > > > Aeham
> > > >
> > > > On 6 November 2014 00:39, Karl Wright <daddywri@gmail.com> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We've lately had several users applying ManifoldCF to what I'd call
> > > > "large"
> > > > > crawls (10M - 100M documents).  This is great news, and I hope
> their
> > > > > experiences turn out well.  I also hope that, once successful,
> these
> > > > users
> > > > > help us document best practices for crawls of this size.
> > > > >
> > > > > It's also a good opportunity to revisit the sizing constraints for
> > MCF
> > > as
> > > > > they exist today.  There are really two areas of interest when we
> > > > consider
> > > > > the large database instances needed to track this number of
> > documents.
> > > > The
> > > > > first consideration is how quickly we can identify records that
> need
> > to
> > > > be
> > > > > processed -- and insure that they are processed in an order that
> > makes
> > > > > sense given throttling constraints on the queue.  The second
> > > > consideration
> > > > > is what kind of system overhead is needed to meet the first
> > constraint,
> > > > and
> > > > > whether this becomes unwieldy at some point.
> > > > >
> > > > > I've been pleasantly surprised at how well the current MCF
> > architecture
> > > > > supports document queuing even when database tables get very large.
> > We
> > > > > recently encountered some bugs here, but those were easily fixed,
> > and I
> > > > > really see little getting in the way from this angle of MCF scaling
> > > even
> > > > to
> > > > > a billion documents now.  However, the overhead needed to manage
> that
> > > > > scheduling relies on keeping one specific index in the proper
> > document
> > > > > order.  Under conditions where jobs are stopped or started, the
> index
> > > > often
> > > > > will need to be reordered.  When there are lots of documents that
> > need
> > > to
> > > > > be reprioritized, this can be a very time-consuming operation.  In
> my
> > > > > opinion this is now the limiting factor for MCF scaling.  When it
> > > starts
> > > > > taking an hour or more to start a job, or stop it, or restart the
> > > agents
> > > > > process, working with MCF becomes clearly less than ideal.  So I
> > think
> > > > this
> > > > > deserves some thought and work.
> > > > >
> > > > > Over the next couple of weeks, I'm hoping to spend some time
> thinking
> > > > > through alternatives to the current index structure, which might
> > permit
> > > > > faster starts and stops.  There's no guarantee of a full solution,
> > but
> > > my
> > > > > hope would be that with some compound index magic there might be
> > > > > significant improvements here, at no cost to the performance of
> > > queuing.
> > > > >
> > > > > Thanks,
> > > > > Karl
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message