manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1249) Keep a separate document priority queue per job, and synchronize with any running jobs on job start
Date Sat, 05 Dec 2015 06:54:10 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042714#comment-15042714
] 

Karl Wright commented on CONNECTORS-1249:
-----------------------------------------

Started (finally) looking at this.

Prioritization is done by bin.  Bins are named, and those names are currently global.  So
all connections that create documents with the same bin name will collide with one another.

Bins are tracked by BinManager.java, with these methods:

{code}
public double[] getIncrementBinValues(String binName, double newBinValue, int count)
{code}

and

{code}
public double[] getIncrementBinValuesInTransaction(String binName, double newBinValue, int
count)
{code}

The bin name is limited to 255 characters.

So there are two required actions to attack this:
(1) We augment the connection's provided bin name with additional information, such as job
ID, connector name, etc;
(2) We make sure that all connectors provide a reasonable bin name that will NOT likely collide
from job to job, e.g. the host name of the connection.

For (1), using the job ID is problematic, because bin-based throttling is supposed to prevent
specific machines/services from being overwhelmed.  But we could use the connector class name
as a distinguishing factor, adding that field to the BinManager as a way of at least segregating
documents by service.

For (2), we merely just need to audit the connectors.


> Keep a separate document priority queue per job, and synchronize with any running jobs
on job start
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1249
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1249
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 2.2
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.3
>
>
> Starting a job when there has been already a long-running job in MCF takes a very long
time, because the documents from the new job don't get processed until the other jobs' current
backlog at the time the new job was started go away.
> Effectively, this is because there is only one stream of document priorities, and all
jobs tap into that.  But there's no reason why we can't have multiple document priority streams,
one per active job, with some redesign work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message