manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: Process behavior of executing multiple jobs
Date Tue, 20 Nov 2012 07:53:04 GMT
Hi Karl.

Thanks for your information. That was very informative.

I will let you know when I see long-term behavior that looks obviously
strange.

Regards,


Shigeki

2012/11/19 Karl Wright <daddywri@gmail.com>

> Hi Shigeki,
>
> This is a complex question, which is actually at the center of what
> ManifoldCF does.
>
> There are two different kinds of scheduling that MCF does.  The first
> is scheduling documents within a single connection.  The second is
> scheduling documents across connections.
>
> Let's start with the first.  Every connector, given a document, has
> the ability to determine what throttling "bins" it belongs in.  A
> throttling bin is an arbitrary grouping of documents that should be
> treated together for the purposes of throttling.  For example, the web
> connector uses a document's server name as a throttling bin, which
> means that any new document from the same server will be rate-limited
> relative to other documents from that server.  This grouping allows
> the ManifoldCF document queue to be "prioritized" (which means that a
> priority number is set) in such a way that documents from all bins
> have an equal probability of being scheduled in a given time interval.
>  Then, the query that finds the next set of documents to crawl can do
> mostly the right thing if it just orders the query based on the
> priority number.
>
> The second layer adjusts for differences in performance between bins
> and between connections.  ManifoldCF keeps track of the performance
> statistics of each connector and each throttle bin.  If the statistics
> show that processing a document for one bin in one connector is
> significantly slower than for the others, it will take that into
> account and learn to give fewer documents from that bin or connection
> to the worker threads during any given time interval.
>
> If the statistics change, it will obviously be a little while before
> ManifoldCF adjusts its behavior.  But eventually it should adjust.
>
> If you are seeing a specific long-term behavior that is not optimal,
> please let us know.  It's been quite a while since anyone has had
> questions/issues in this area.
>
> Thanks,
> Karl
>
> On Sun, Nov 18, 2012 at 10:55 PM, Shigeki Kobayashi
> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> >
> > Hi.
> >
> > I have a question of process behavior of executing multiple jobs.
> >
> > I run MCF1.0 on Tomcat, crawl files on Windows file servers, and index
> them
> > into Solr3.6.
> >
> > When I set multiple jobs and execute them at the same times, I realize
> the
> > number of documents processed by each job seems to be partial to another.
> > For example, while one job processes 100 documents  the other job only
> > process 5 documents yet. At the end, all of jobs completes processing,
> but I
> > wonder how those jobs can process documents evenly at the same time.
> > On the other hand, I wonder how MCF determines priority of each
> documents of
> > each job to crawl and index.
> >
> >
> > Regards,
> >
> >
> > Shigeki
>

Mime
View raw message