manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1153) Documents crawled using manifoldcf 1.6 or earlier are needlessly recrawled after upgrade to 1.7 or later
Date Wed, 28 Jan 2015 16:28:35 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295352#comment-14295352
] 

Karl Wright commented on CONNECTORS-1153:
-----------------------------------------

For case (1), some explanation.  Prior to 1.7 there were no transformation connectors at all,
so the legacy value of the transformation version string was effectively undefined.  The column
was added, but no initial value was ever set.  This was in part an oversight; the code to
set it would ideally have been part of the upgrade.

Now, there are two choices: (a) I can either add it to the upgrade now, which may not help
you, or (b) I can try to determine if it is safe for an empty transformation version read
from the database to be considered equivalent to a value of "0+0!".  Either fix would only
be done for MCF 1.x.

> Documents crawled using manifoldcf 1.6 or earlier are needlessly recrawled after upgrade
to 1.7 or later
> --------------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1153
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1153
>             Project: ManifoldCF
>          Issue Type: Bug
>    Affects Versions: ManifoldCF 1.7, ManifoldCF 1.8
>            Reporter: Aeham Abushwashi
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.8.1, ManifoldCF 2.0.1, ManifoldCF 1.9, ManifoldCF 2.1
>
>
> After upgrading to mcf 1.7 or later, pre-existing documents are recrawled and re-indexed
even if they have not changed in any way since their last pre-upgrade crawl. The impact can
be significant for large manifold deployments with millions+ static documents.
> There appear to be three contributing factors:
> 1. The empty transformation version of a legacy document is different from the initial
value of "0+0!" - in PipelineObjectWithVersions#buildAddPipeline and IncrementalIngester#checkFetchDocument
> 2. Incorrect comparison of output versions in PipelineObjectWithVersions#buildAddPipeline
where oldOutputVersion is compared to a VersionContext object instead of the version string,
which can be obtained by calling VersionContext#getVersionString - if IPipelineSpecification#getStageDescriptionString
continues to return a VersionContext object, a rename of the method could be useful
> 3. In PipelineObjectWithVersions#buildAddPipeline, a null value for newAuthorityNameString
is not treated the same as an empty string (like it is in other methods)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message