manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aeham Abushwashi (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CONNECTORS-1153) Documents crawled using manifoldcf 1.6 or earlier are needlessly recrawled after upgrade to 1.7 or later
Date Wed, 28 Jan 2015 16:49:35 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295377#comment-14295377
] 

Aeham Abushwashi edited comment on CONNECTORS-1153 at 1/28/15 4:49 PM:
-----------------------------------------------------------------------

Thanks Karl.

For #2, the code I'm looking at (in dev_1x branch) is:
{code}
        if (needToReindex == false)
        {
          needToReindex = (!oldDocumentVersion.equals(newDocumentVersion) ||
            !oldParameterVersion.equals(newParameterVersion) ||
            !oldOutputVersion.equals(fullSpec.getStageDescriptionString(outputStage)) ||
!oldAuthorityName.equals((newAuthorityNameString==null)?"":newAuthorityNameString));
        }
{code}

and the declaration of the getStageDescriptionString method is as follows:
{code}
public interface IPipelineSpecification extends IPipelineConnections
{
  public static final String _rcsid = "@(#)$Id: IPipelineSpecification.java 1644404 2014-12-10
13:42:00Z kwright $";

  /** Get the description string for a pipeline stage.
  *@param stage is the stage to get the connection name for.
  *@return the description string that stage.
  */
  public VersionContext getStageDescriptionString(int stage);
  
}
{code}


was (Author: aeham.abushwashi):
Thanks Karl.

For #2, the code I'm looking at (in dev_1x branch) is:
        if (needToReindex == false)
        {
          needToReindex = (!oldDocumentVersion.equals(newDocumentVersion) ||
            !oldParameterVersion.equals(newParameterVersion) ||
            !oldOutputVersion.equals(fullSpec.getStageDescriptionString(outputStage)) ||
            !oldAuthorityName.equals((newAuthorityNameString==null)?"":newAuthorityNameString));
        }

and the declaration of the getStageDescriptionString method is as follows:

public interface IPipelineSpecification extends IPipelineConnections
{
  public static final String _rcsid = "@(#)$Id: IPipelineSpecification.java 1644404 2014-12-10
13:42:00Z kwright $";

  /** Get the description string for a pipeline stage.
  *@param stage is the stage to get the connection name for.
  *@return the description string that stage.
  */
  public VersionContext getStageDescriptionString(int stage);
  
}

> Documents crawled using manifoldcf 1.6 or earlier are needlessly recrawled after upgrade
to 1.7 or later
> --------------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1153
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1153
>             Project: ManifoldCF
>          Issue Type: Bug
>    Affects Versions: ManifoldCF 1.7, ManifoldCF 1.8
>            Reporter: Aeham Abushwashi
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.8.1, ManifoldCF 2.0.1, ManifoldCF 1.9, ManifoldCF 2.1
>
>
> After upgrading to mcf 1.7 or later, pre-existing documents are recrawled and re-indexed
even if they have not changed in any way since their last pre-upgrade crawl. The impact can
be significant for large manifold deployments with millions+ static documents.
> There appear to be three contributing factors:
> 1. The empty transformation version of a legacy document is different from the initial
value of "0+0!" - in PipelineObjectWithVersions#buildAddPipeline and IncrementalIngester#checkFetchDocument
> 2. Incorrect comparison of output versions in PipelineObjectWithVersions#buildAddPipeline
where oldOutputVersion is compared to a VersionContext object instead of the version string,
which can be obtained by calling VersionContext#getVersionString - if IPipelineSpecification#getStageDescriptionString
continues to return a VersionContext object, a rename of the method could be useful
> 3. In PipelineObjectWithVersions#buildAddPipeline, a null value for newAuthorityNameString
is not treated the same as an empty string (like it is in other methods)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message