Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 296241003E for ; Thu, 20 Nov 2014 14:55:35 +0000 (UTC) Received: (qmail 45166 invoked by uid 500); 20 Nov 2014 14:55:35 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 45070 invoked by uid 500); 20 Nov 2014 14:55:35 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 44876 invoked by uid 99); 20 Nov 2014 14:55:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Nov 2014 14:55:34 +0000 Date: Thu, 20 Nov 2014 14:55:34 +0000 (UTC) From: "Karl Wright (JIRA)" To: dev@manifoldcf.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CONNECTORS-1100) Improve ManifoldCF scalability by adopting dynamic reprioritization MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CONNECTORS-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219445#comment-14219445 ] Karl Wright commented on CONNECTORS-1100: ----------------------------------------- Also, r1640749 (trunk), r1640751 (dev_1x) > Improve ManifoldCF scalability by adopting dynamic reprioritization > ------------------------------------------------------------------- > > Key: CONNECTORS-1100 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1100 > Project: ManifoldCF > Issue Type: Improvement > Components: Framework crawler agent > Affects Versions: ManifoldCF 1.8, ManifoldCF 2.0 > Reporter: Karl Wright > Assignee: Karl Wright > Fix For: ManifoldCF 1.8, ManifoldCF 2.0 > > > The single greatest impediment to MCF scalability at this point is the time that is required to reprioritize documents. This can be approved by adopting dynamic reprioritization. > This process involves the following changes to the reprioritization process: > - At the time of reprioritization, wipe the docpriority field to its null value. (Q: How long does this take?) > - A background thread picks up documents on the queue with null priorities, and prioritizes them based on current conditions, doing 1000 or 10000 at a time > - We need a new field to properly manage this; a simple boolean field which I'll call "needpriority", and an index based on it too. > - ReprioritizationTracker currently has a notion of "reprioritization cycles". Cycles are interesting insofar as the minimum depth gets reset at the beginning of them. We will need to redefine what a cycle is, if we wish to maintain the current logic. Specifically, a "cycle" looks like this: > (1) Remove all current document priorities (set to nullDocumentPriority), while also setting "needpriority" field to "true" > (2) Reset priority calculation values (e.g. minimum depth) > (3) Done with *first part* of cycle > (4) Reprioritize via thread over time > (5) When no more to do, done with *second part* of cycle > - But, the whole "priorityset" logic is not needed anymore, if there's a persistent thread and a "needpriority" index. So we only will need the first part of the cycle, above, and no priorityset field is needed anymore at all. We just need a global write lock to coordinate the priority setting threads cross-cluster. These also have to synchronize, though, with any "reprioritization" cycles. > - We can simply change the contract of IReprioritizationTracker to include entrance and exit for the reprioritization threads, and manage any locking internally. This would be PROVIDED the threads exercised good cleanup hygiene. -- This message was sent by Atlassian JIRA (v6.3.4#6332)