Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9E69417BAB for ; Wed, 28 Jan 2015 16:28:34 +0000 (UTC) Received: (qmail 95750 invoked by uid 500); 28 Jan 2015 16:28:35 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 95710 invoked by uid 500); 28 Jan 2015 16:28:35 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 95672 invoked by uid 99); 28 Jan 2015 16:28:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Jan 2015 16:28:35 +0000 Date: Wed, 28 Jan 2015 16:28:35 +0000 (UTC) From: "Karl Wright (JIRA)" To: dev@manifoldcf.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CONNECTORS-1153) Documents crawled using manifoldcf 1.6 or earlier are needlessly recrawled after upgrade to 1.7 or later MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CONNECTORS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295352#comment-14295352 ] Karl Wright commented on CONNECTORS-1153: ----------------------------------------- For case (1), some explanation. Prior to 1.7 there were no transformation connectors at all, so the legacy value of the transformation version string was effectively undefined. The column was added, but no initial value was ever set. This was in part an oversight; the code to set it would ideally have been part of the upgrade. Now, there are two choices: (a) I can either add it to the upgrade now, which may not help you, or (b) I can try to determine if it is safe for an empty transformation version read from the database to be considered equivalent to a value of "0+0!". Either fix would only be done for MCF 1.x. > Documents crawled using manifoldcf 1.6 or earlier are needlessly recrawled after upgrade to 1.7 or later > -------------------------------------------------------------------------------------------------------- > > Key: CONNECTORS-1153 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1153 > Project: ManifoldCF > Issue Type: Bug > Affects Versions: ManifoldCF 1.7, ManifoldCF 1.8 > Reporter: Aeham Abushwashi > Assignee: Karl Wright > Fix For: ManifoldCF 1.8.1, ManifoldCF 2.0.1, ManifoldCF 1.9, ManifoldCF 2.1 > > > After upgrading to mcf 1.7 or later, pre-existing documents are recrawled and re-indexed even if they have not changed in any way since their last pre-upgrade crawl. The impact can be significant for large manifold deployments with millions+ static documents. > There appear to be three contributing factors: > 1. The empty transformation version of a legacy document is different from the initial value of "0+0!" - in PipelineObjectWithVersions#buildAddPipeline and IncrementalIngester#checkFetchDocument > 2. Incorrect comparison of output versions in PipelineObjectWithVersions#buildAddPipeline where oldOutputVersion is compared to a VersionContext object instead of the version string, which can be obtained by calling VersionContext#getVersionString - if IPipelineSpecification#getStageDescriptionString continues to return a VersionContext object, a rename of the method could be useful > 3. In PipelineObjectWithVersions#buildAddPipeline, a null value for newAuthorityNameString is not treated the same as an empty string (like it is in other methods) -- This message was sent by Atlassian JIRA (v6.3.4#6332)