manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1122) Explore ways to make job start be faster in systems with lots of documents
Date Mon, 15 Dec 2014 12:35:14 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246612#comment-14246612
] 

Karl Wright commented on CONNECTORS-1122:
-----------------------------------------

The fundamental issue is that document bins are *not* stored in the schema.  Connectors produce
the document bins for a given document in code.  When a job starts, certain documents in the
job's queue are put into a state where they need priorities to be determined.  Similarly,
when a job is aborted, documents that had priorities in that job beforehand have to have those
priorities rescinded.  In both cases, since document bins are global, the allocation of document
priorities is suddenly incorrect, if there are other documents in other jobs that have document
priorities assigned which share the same document bins as those documents whose state is being
changed.  This is why, at the moment, ManifoldCF takes the approach of reprioritizing all
documents at the time when (say) jobs start or end.

At job start time, if only the documents being marked active for the new job were marked,
then any documents present whose bins overlapped existing jobs would find that they would
be placed at the back of the line. *No* documents from the overlapping bins would be processed
in the new job until *all* the documents currently prioritized in the older jobs were processed.

At job end time, when you rescind document priorities, there are suddenly "holes" in the prioritization,
and the efficiency of ManifoldCF document distribution becomes lower.

For the start case, it may be acceptable to not fully reprioritize.  This is one change that
would be easy to explore.  For the job abort case, it's not going to work; the reprioritization
must take place.


> Explore ways to make job start be faster in systems with lots of documents
> --------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1122
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1122
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.8, ManifoldCF 2.0
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.9, ManifoldCF 2.1
>
>
> Job start requires all documents to be marked as needing reprioritization now.  We should
consider ways in which we can reduce the need to do this as much as possible.  For example,
if there are NO documents at all for a job, reprioritization is by definition unneeded.  Alternatively,
coming up with a way of determining if there are any bin-level overlaps between documents
made active by a job start at documents elsewhere, we could be more targeted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message