manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1100) Improve ManifoldCF scalability by adopting dynamic reprioritization
Date Mon, 10 Nov 2014 13:14:34 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204746#comment-14204746
] 

Karl Wright commented on CONNECTORS-1100:
-----------------------------------------

Working on this design; committed a good chunk of it.  What's left:

- ReprioritizationTracker needs to be revised to handle the current reprioritization model
- We need a way of preventing reprioritizations from overwriting null document priorities
IF the needPriority field is no longer set by the time the priority is assigned
- I've wired up the "blocking documents" logic to basically reprioritize any document that
is encountered in the stuffer query that cannot be returned for throttling reasons.  This
logic will cause massive amounts of ongoing reprioritization, because when the queue comes
near to being empty, and the queuing is gated by throttling concerns, then all the documents
encountered will be reprioritized.  It may be acceptable, but before we had a 10-minute window
of prioritization stability that's no longer present.

> Improve ManifoldCF scalability by adopting dynamic reprioritization
> -------------------------------------------------------------------
>
>                 Key: CONNECTORS-1100
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1100
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.8, ManifoldCF 2.0
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.8, ManifoldCF 2.0
>
>
> The single greatest impediment to MCF scalability at this point is the time that is required
to reprioritize documents.  This can be approved by adopting dynamic reprioritization.
> This process involves the following changes to the reprioritization process:
> - At the time of reprioritization, wipe the docpriority field to its null value. (Q:
How long does this take?)
> - A background thread picks up documents on the queue with null priorities, and prioritizes
them based on current conditions, doing 1000 or 10000 at a time
> - We need a new field to properly manage this; a simple boolean field which I'll call
"needpriority", and an index based on it too.
> - ReprioritizationTracker currently has a notion of "reprioritization cycles".  Cycles
are interesting insofar as the minimum depth gets reset at the beginning of them.  We will
need to redefine what a cycle is, if we wish to maintain the current logic.  Specifically,
a "cycle" looks like this:
> (1) Remove all current document priorities (set to nullDocumentPriority), while also
setting "needpriority" field to "true"
> (2) Reset priority calculation values (e.g. minimum depth)
> (3) Done with *first part* of cycle
> (4) Reprioritize via thread over time
> (5) When no more to do, done with *second part* of cycle
> - But, the whole "priorityset" logic is not needed anymore, if there's a persistent thread
and a "needpriority" index.  So we only will need the first part of the cycle, above, and
no priorityset field is needed anymore at all.  We just need a global write lock to coordinate
the priority setting threads cross-cluster.  These also have to synchronize, though, with
any "reprioritization" cycles.    
> - We can simply change the contract of IReprioritizationTracker to include entrance and
exit for the reprioritization threads, and manage any locking  internally.  This would be
PROVIDED the threads exercised good cleanup hygiene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message