manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Process behavior of executing multiple jobs
Date Mon, 19 Nov 2012 09:43:43 GMT
Hi Shigeki,

This is a complex question, which is actually at the center of what
ManifoldCF does.

There are two different kinds of scheduling that MCF does.  The first
is scheduling documents within a single connection.  The second is
scheduling documents across connections.

Let's start with the first.  Every connector, given a document, has
the ability to determine what throttling "bins" it belongs in.  A
throttling bin is an arbitrary grouping of documents that should be
treated together for the purposes of throttling.  For example, the web
connector uses a document's server name as a throttling bin, which
means that any new document from the same server will be rate-limited
relative to other documents from that server.  This grouping allows
the ManifoldCF document queue to be "prioritized" (which means that a
priority number is set) in such a way that documents from all bins
have an equal probability of being scheduled in a given time interval.
 Then, the query that finds the next set of documents to crawl can do
mostly the right thing if it just orders the query based on the
priority number.

The second layer adjusts for differences in performance between bins
and between connections.  ManifoldCF keeps track of the performance
statistics of each connector and each throttle bin.  If the statistics
show that processing a document for one bin in one connector is
significantly slower than for the others, it will take that into
account and learn to give fewer documents from that bin or connection
to the worker threads during any given time interval.

If the statistics change, it will obviously be a little while before
ManifoldCF adjusts its behavior.  But eventually it should adjust.

If you are seeing a specific long-term behavior that is not optimal,
please let us know.  It's been quite a while since anyone has had
questions/issues in this area.

Thanks,
Karl

On Sun, Nov 18, 2012 at 10:55 PM, Shigeki Kobayashi
<shigeki.kobayashi3@g.softbank.co.jp> wrote:
>
> Hi.
>
> I have a question of process behavior of executing multiple jobs.
>
> I run MCF1.0 on Tomcat, crawl files on Windows file servers, and index them
> into Solr3.6.
>
> When I set multiple jobs and execute them at the same times, I realize the
> number of documents processed by each job seems to be partial to another.
> For example, while one job processes 100 documents  the other job only
> process 5 documents yet. At the end, all of jobs completes processing, but I
> wonder how those jobs can process documents evenly at the same time.
> On the other hand, I wonder how MCF determines priority of each documents of
> each job to crawl and index.
>
>
> Regards,
>
>
> Shigeki

Mime
View raw message