manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aeham Abushwashi (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1153) Documents crawled using manifoldcf 1.6 or earlier are needlessly recrawled after upgrade to 1.7 or later
Date Thu, 29 Jan 2015 16:49:34 GMT


Aeham Abushwashi commented on CONNECTORS-1153:

Sorry, my last comment was a little cryptic. What I meant to say is the following statement:
      if (oldTransformationVersion.equals(newTransformationVersion))
        return true;

should be
      if (!oldTransformationVersion.equals(newTransformationVersion))
        return true;

Separately, in buildAddPipeline, the following line throws an exception for a never-seen-before
        if (oldTransformationVersion.length() == 0)
          oldTransformationVersion = "0+0!";

The fix is to check for null first
        if (oldTransformationVersion == null || oldTransformationVersion.length() == 0)
          oldTransformationVersion = "0+0!";

> Documents crawled using manifoldcf 1.6 or earlier are needlessly recrawled after upgrade
to 1.7 or later
> --------------------------------------------------------------------------------------------------------
>                 Key: CONNECTORS-1153
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Bug
>    Affects Versions: ManifoldCF 1.7, ManifoldCF 1.8
>            Reporter: Aeham Abushwashi
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.8.1, ManifoldCF 2.0.1, ManifoldCF 1.9, ManifoldCF 2.1
> After upgrading to mcf 1.7 or later, pre-existing documents are recrawled and re-indexed
even if they have not changed in any way since their last pre-upgrade crawl. The impact can
be significant for large manifold deployments with millions+ static documents.
> There appear to be three contributing factors:
> 1. The empty transformation version of a legacy document is different from the initial
value of "0+0!" - in PipelineObjectWithVersions#buildAddPipeline and IncrementalIngester#checkFetchDocument
> 2. Incorrect comparison of output versions in PipelineObjectWithVersions#buildAddPipeline
where oldOutputVersion is compared to a VersionContext object instead of the version string,
which can be obtained by calling VersionContext#getVersionString - if IPipelineSpecification#getStageDescriptionString
continues to return a VersionContext object, a rename of the method could be useful
> 3. In PipelineObjectWithVersions#buildAddPipeline, a null value for newAuthorityNameString
is not treated the same as an empty string (like it is in other methods)

This message was sent by Atlassian JIRA

View raw message